Do not know how to call it but probably it is a new (and cool!) feature request?

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Do not know how to call it but probably it is a new (and cool!) feature request?

Lukáš Vlček
Hi,

I am wondering if elasticsearch can support me in the following scenario out of the box and if not then whether a new feature can be implemented to support it.

In my case I have various document types (mails, blogs, IRC logs, ... etc). Each document has an author but in reality author (for example "Lukas Vlcek") can use various nicks across whole document corpus, for example he uses nick "Lukas" for mails and nicks "luk1" and "luk2" for IRC logs. Now, I would like to be able to provide name consolidated filtering and query capabilities via the search UI.

In other words if user selects author: "Lukas Vlcek", then
1) search results would contain mail results for "Lukas" and IRC logs results for both "luk1" and "luk2"
2) facets would aggregate "Lukas","luk1" and "luk2" under single "Lukas Vlcek" item

As of now it would be probably possible to workaround this somehow but not very generally and it would probably require frequent reindexing with every change in nicks for particular user. Also I do not think this is search groupings (it is not about deduplication, it is more like synonyms... but without need to reindex). And AFAIK parent/child has a little limited query capabilities (think of complex facets and filtering, custom scoring and things like that...). Nested types would require expensive reindexing...


May be my idea is too naive but I think it shouldn't be that hard to have direct support for something like that in ES given that the set of "nicks" data is not too large.

1) Let's have a separate index that would define author - nicks relation. It would contain documents like { author: "Lukas Vlcek", nicks: [ "Lukas'", "luk1", "luk2" ]}
2) Have this index be automatically replicated to all nodes (or at least to those nodes that contain shards with data that needs to be queried when doing searches described above)
3) Then when doing a search, it could expand the author field values for search (that would be something like real time synonyms) and also use it for facet aggregations (this could be probably expensive part depending on the size of the data).

As a result I could keep the author - nicks relation in separated (hopefully small) index that could be updated anytime and search requests would take account on it (in "real-time" fashion) yielding aggregated facets (nicks would be mapped to author name) and search results (where individual search hits would provide both original nick and corresponding author name). Is that doable?

Comments/suggestions welcome.

Regards,
Lukas


Reply | Threaded
Open this post in threaded view
|

Re: Do not know how to call it but probably it is a new (and cool!) feature request?

Karussell
Hi Lukas,

Yes, this would be a nice feature. And yes this is even doable at the
moment but not that optimal as you suggested (server side fetching of
the filter-alias-index replicated to all nodes).

But you can do this from the client side: you could create and query
that alias index and then use a Terms-query/filter to fetch documents
with authorX OR authorY OR ...

http://www.elasticsearch.org/guide/reference/query-dsl/terms-filter.html

http://www.elasticsearch.org/guide/reference/query-dsl/terms-query.html

BTW: the alias index *filtering* feature should be also in your
toolbox + could even be used to solve a part of this problem:

http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html

Peter.
Reply | Threaded
Open this post in threaded view
|

Re: Do not know how to call it but probably it is a new (and cool!) feature request?

Lukáš Vlček
Peter,

Using terms-query/filter would not really help me with the facets portion and that is important. I did not specifically mention such use case but I want the author-nicks mapping to work even if user does not select any author at all. So for example user just search for "Lucene" token. And imagine that we search Lucene mail lists (dev/users/announcements/...etc) and besides top scoring documents I want to display top authors facet. And if one author is using more email addresses then I would like to consolidate contributions from all individual email accounts of particular user under a single alias.

Generally speaking, I think (well ... hope) that such functionality is doable in ES and I am sure people would find many exotic use cases for it. Not just the one mentioned above (in fact the above problem can be solved in many different ways but I think that if I could get it directly from ES that would be really cool and it would save me a lot of work and maintenance!).

Regards,
Lukas

On Thu, Dec 8, 2011 at 9:46 PM, Karussell <[hidden email]> wrote:
Hi Lukas,

Yes, this would be a nice feature. And yes this is even doable at the
moment but not that optimal as you suggested (server side fetching of
the filter-alias-index replicated to all nodes).

But you can do this from the client side: you could create and query
that alias index and then use a Terms-query/filter to fetch documents
with authorX OR authorY OR ...

http://www.elasticsearch.org/guide/reference/query-dsl/terms-filter.html

http://www.elasticsearch.org/guide/reference/query-dsl/terms-query.html

BTW: the alias index *filtering* feature should be also in your
toolbox + could even be used to solve a part of this problem:

http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html

Peter.

Reply | Threaded
Open this post in threaded view
|

Re: Do not know how to call it but probably it is a new (and cool!) feature request?

Michael Sick
Lukáš,

A very common use case for these type of features would be in indexing any documents that were decorated after NPL Named Entity Recognition analysis. This is very popular for social network analysis where information like identity, location, group, role, intent, ... are parsed out of documents and included as meta-data. Typically these are decorated as a pointers back to a point of reference and some type of confidence score.

If one could use ES to efficiently store / query these type of relationships, it would become an attractive sink for data out of systems like Apache UIMA/GATE. I plan to add some features to my current work that would benefit from this type of functionality but not for 6-8 months. 

I haven't considered it too much at this point, but I've always wanted something like HBase's co-processor functionality for ES where you can do pre/post processing on inserts/updates/deletes/... 

--Mike
On Thu, Dec 8, 2011 at 4:37 PM, Lukáš Vlček <[hidden email]> wrote:
Peter,

Using terms-query/filter would not really help me with the facets portion and that is important. I did not specifically mention such use case but I want the author-nicks mapping to work even if user does not select any author at all. So for example user just search for "Lucene" token. And imagine that we search Lucene mail lists (dev/users/announcements/...etc) and besides top scoring documents I want to display top authors facet. And if one author is using more email addresses then I would like to consolidate contributions from all individual email accounts of particular user under a single alias.

Generally speaking, I think (well ... hope) that such functionality is doable in ES and I am sure people would find many exotic use cases for it. Not just the one mentioned above (in fact the above problem can be solved in many different ways but I think that if I could get it directly from ES that would be really cool and it would save me a lot of work and maintenance!).

Regards,
Lukas

On Thu, Dec 8, 2011 at 9:46 PM, Karussell <[hidden email]> wrote:
Hi Lukas,

Yes, this would be a nice feature. And yes this is even doable at the
moment but not that optimal as you suggested (server side fetching of
the filter-alias-index replicated to all nodes).

But you can do this from the client side: you could create and query
that alias index and then use a Terms-query/filter to fetch documents
with authorX OR authorY OR ...

http://www.elasticsearch.org/guide/reference/query-dsl/terms-filter.html

http://www.elasticsearch.org/guide/reference/query-dsl/terms-query.html

BTW: the alias index *filtering* feature should be also in your
toolbox + could even be used to solve a part of this problem:

http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html

Peter.


Reply | Threaded
Open this post in threaded view
|

Re: Do not know how to call it but probably it is a new (and cool!) feature request?

Ivan Brusic
In reply to this post by Lukáš Vlček
Your solution is similar to using a map-side join in Hadoop. There is
no point going out to HDFS for data if it can simply be stored in
memory, especially if the data is used by every machine in the
cluster.

The caveat is of course what is considered large or not. Self-limiting
is not always the best solution since some would "exploit" the feature
and then bad mouth the product when it does not work.

Cheers,

Ivan

On Thu, Dec 8, 2011 at 2:33 AM, Lukáš Vlček <[hidden email]> wrote:
> Hi,
>
> May be my idea is too naive but I think it shouldn't be that hard to have
> direct support for something like that in ES given that the set of "nicks"
> data is not too large.
>
Reply | Threaded
Open this post in threaded view
|

Re: Do not know how to call it but probably it is a new (and cool!) feature request?

Lukáš Vlček
Hi Ivan,

if it would be possible to have this functionality as an ES-independent plugin, then I would not worry about bad mouths that much. They could only blame the author of the plugin.

@Shay, do you thing something like that is possible to implement as a plugin?

Regards,
Lukas

On Fri, Dec 9, 2011 at 1:16 AM, Ivan Brusic <[hidden email]> wrote:
Your solution is similar to using a map-side join in Hadoop. There is
no point going out to HDFS for data if it can simply be stored in
memory, especially if the data is used by every machine in the
cluster.

The caveat is of course what is considered large or not. Self-limiting
is not always the best solution since some would "exploit" the feature
and then bad mouth the product when it does not work.

Cheers,

Ivan

On Thu, Dec 8, 2011 at 2:33 AM, Lukáš Vlček <[hidden email]> wrote:
> Hi,
>
> May be my idea is too naive but I think it shouldn't be that hard to have
> direct support for something like that in ES given that the set of "nicks"
> data is not too large.
>

Reply | Threaded
Open this post in threaded view
|

Re: Do not know how to call it but probably it is a new (and cool!) feature request?

Karussell
In reply to this post by Lukáš Vlček
Hi Lukas,

see below

> So for example user just search for "Lucene" token. And
> imagine that we search Lucene mail lists (dev/users/announcements/...etc)
> and besides top scoring documents I want to display top authors facet. And
> if one author is using more email addresses then I would like to
> consolidate contributions from all individual email accounts of particular
> user under a single alias.

Ah, ok, this is indeed an additional requirement for the 'aliasing
index' I have in mind :)

But wouldn't it be somehow possible with a script while faceting?

http://www.elasticsearch.org/guide/reference/api/search/facets/terms-stats-facet.html

Peter.
Reply | Threaded
Open this post in threaded view
|

Re: Do not know how to call it but probably it is a new (and cool!) feature request?

Karussell
In reply to this post by Lukáš Vlček
Why not have a "real-username" for all users indexed which is not
displayed (or an ID) and only add the current alias as displayable
name?

Or is this not possible?

On 9 Dez., 09:46, Lukáš Vlček <[hidden email]> wrote:

> Hi Ivan,
>
> if it would be possible to have this functionality as an ES-independent
> plugin, then I would not worry about bad mouths that much. They could only
> blame the author of the plugin.
>
> @Shay, do you thing something like that is possible to implement as a
> plugin?
>
> Regards,
> Lukas
>
>
>
>
>
>
>
> On Fri, Dec 9, 2011 at 1:16 AM, Ivan Brusic <[hidden email]> wrote:
> > Your solution is similar to using a map-side join in Hadoop. There is
> > no point going out to HDFS for data if it can simply be stored in
> > memory, especially if the data is used by every machine in the
> > cluster.
>
> > The caveat is of course what is considered large or not. Self-limiting
> > is not always the best solution since some would "exploit" the feature
> > and then bad mouth the product when it does not work.
>
> > Cheers,
>
> > Ivan
>
> > On Thu, Dec 8, 2011 at 2:33 AM, Lukáš Vlček <[hidden email]> wrote:
> > > Hi,
>
> > > May be my idea is too naive but I think it shouldn't be that hard to have
> > > direct support for something like that in ES given that the set of
> > "nicks"
> > > data is not too large.
Reply | Threaded
Open this post in threaded view
|

Re: Do not know how to call it but probably it is a new (and cool!) feature request?

Lukáš Vlček
That is definitely possible but still you have to reindex all related documents if you learn that you need to change things (for example when you learn that you assigned given nick a wrong user name, or if you want to change the name of the user). As I said there are definitely many ways how to approach my use case but having some sort of out of box support in ES would be really great, such functionality would open door for other crazy experiments...

Regards,
Lukas

On Fri, Dec 9, 2011 at 11:08 AM, Karussell <[hidden email]> wrote:
Why not have a "real-username" for all users indexed which is not
displayed (or an ID) and only add the current alias as displayable
name?

Or is this not possible?

On 9 Dez., 09:46, Lukáš Vlček <[hidden email]> wrote:
> Hi Ivan,
>
> if it would be possible to have this functionality as an ES-independent
> plugin, then I would not worry about bad mouths that much. They could only
> blame the author of the plugin.
>
> @Shay, do you thing something like that is possible to implement as a
> plugin?
>
> Regards,
> Lukas
>
>
>
>
>
>
>
> On Fri, Dec 9, 2011 at 1:16 AM, Ivan Brusic <[hidden email]> wrote:
> > Your solution is similar to using a map-side join in Hadoop. There is
> > no point going out to HDFS for data if it can simply be stored in
> > memory, especially if the data is used by every machine in the
> > cluster.
>
> > The caveat is of course what is considered large or not. Self-limiting
> > is not always the best solution since some would "exploit" the feature
> > and then bad mouth the product when it does not work.
>
> > Cheers,
>
> > Ivan
>
> > On Thu, Dec 8, 2011 at 2:33 AM, Lukáš Vlček <[hidden email]> wrote:
> > > Hi,
>
> > > May be my idea is too naive but I think it shouldn't be that hard to have
> > > direct support for something like that in ES given that the set of
> > "nicks"
> > > data is not too large.

Reply | Threaded
Open this post in threaded view
|

Re: Do not know how to call it but probably it is a new (and cool!) feature request?

Lukáš Vlček
May be I should have said that as of now I do not have any dataset of "username - nicks". It will be built gradually over time and I am looking for some way how to not reindex the data with every update/change in this relatively small "username-nicks" dataset.

On Fri, Dec 9, 2011 at 11:33 AM, Lukáš Vlček <[hidden email]> wrote:
That is definitely possible but still you have to reindex all related documents if you learn that you need to change things (for example when you learn that you assigned given nick a wrong user name, or if you want to change the name of the user). As I said there are definitely many ways how to approach my use case but having some sort of out of box support in ES would be really great, such functionality would open door for other crazy experiments...

Regards,
Lukas


On Fri, Dec 9, 2011 at 11:08 AM, Karussell <[hidden email]> wrote:
Why not have a "real-username" for all users indexed which is not
displayed (or an ID) and only add the current alias as displayable
name?

Or is this not possible?

On 9 Dez., 09:46, Lukáš Vlček <[hidden email]> wrote:
> Hi Ivan,
>
> if it would be possible to have this functionality as an ES-independent
> plugin, then I would not worry about bad mouths that much. They could only
> blame the author of the plugin.
>
> @Shay, do you thing something like that is possible to implement as a
> plugin?
>
> Regards,
> Lukas
>
>
>
>
>
>
>
> On Fri, Dec 9, 2011 at 1:16 AM, Ivan Brusic <[hidden email]> wrote:
> > Your solution is similar to using a map-side join in Hadoop. There is
> > no point going out to HDFS for data if it can simply be stored in
> > memory, especially if the data is used by every machine in the
> > cluster.
>
> > The caveat is of course what is considered large or not. Self-limiting
> > is not always the best solution since some would "exploit" the feature
> > and then bad mouth the product when it does not work.
>
> > Cheers,
>
> > Ivan
>
> > On Thu, Dec 8, 2011 at 2:33 AM, Lukáš Vlček <[hidden email]> wrote:
> > > Hi,
>
> > > May be my idea is too naive but I think it shouldn't be that hard to have
> > > direct support for something like that in ES given that the set of
> > "nicks"
> > > data is not too large.


Reply | Threaded
Open this post in threaded view
|

Re: Do not know how to call it but probably it is a new (and cool!) feature request?

Karussell
Yes, please open one or even two issues :)

I think one that makes a more generic server-side refetching possible
via scripting or similar **

and then one issue attacking "external aliased query handling and
facet aggregation"

Peter.


**
query: {
     some normal term query selecting some docs e.g. friends or nicks
of a user

     doAfterQuery: myQueryScript
}

in myQueryScript the resulting hits are available and then one could
construct a terms query JSON from docs[i].nick

how to attack pagination? and could the generated query even have
another doAfterQuery part or return several queries?
Reply | Threaded
Open this post in threaded view
|

Re: Do not know how to call it but probably it is a new (and cool!) feature request?

Lukáš Vlček
Peter,

doAfterQuery approach will not help with facets. It is too late for it.

What would be cool is some kind of integration with distributed in-memory datastore that could be consulted at any phase of query execution and score calculation (not only after query). And I am sure Shay already thought about this... but since such feature is not available now I am at least looking (asking) for some intermediate step :-)

Regards,
Lukas

On Fri, Dec 9, 2011 at 12:16 PM, Karussell <[hidden email]> wrote:
Yes, please open one or even two issues :)

I think one that makes a more generic server-side refetching possible
via scripting or similar **

and then one issue attacking "external aliased query handling and
facet aggregation"

Peter.


**
query: {
    some normal term query selecting some docs e.g. friends or nicks
of a user

    doAfterQuery: myQueryScript
}

in myQueryScript the resulting hits are available and then one could
construct a terms query JSON from docs[i].nick

how to attack pagination? and could the generated query even have
another doAfterQuery part or return several queries?