Faceting on _id field

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Faceting on _id field

sujoysett
Hi,

Sorry if I am asking a too obvious question, but is term facet possible on the _id field of an index?

The reason I am trying to facet an already unique field is this:
I want to find the documents that are in one index but not in another.
That is, Docs(index2) is a subset of Docs(index1).
And I want to find Docs(index1)  MINUS Docs(index2).

I can do this by running a facet query simultaneously on both indices with reverse_count on any unique field belonging to both the indices, and the responses with count 1 are my result. I am currently doing this by indexing the _id also as a field in the _source of the documents, but the easier way would be a facet on _id.

Is it possible?

Thanks in advance,
Sujoy.

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Faceting on _id field

sujoysett

Or can there be any simpler approach to my basic objective?
 (For identifying documents that are in one index but not in another.)

Thanks,
Sujoy.

On Monday, September 17, 2012 11:49:30 AM UTC+5:30, Sujoy Sett wrote:
Hi,

Sorry if I am asking a too obvious question, but is term facet possible on the _id field of an index?

The reason I am trying to facet an already unique field is this:
I want to find the documents that are in one index but not in another.
That is, Docs(index2) is a subset of Docs(index1).
And I want to find Docs(index1)  MINUS Docs(index2).

I can do this by running a facet query simultaneously on both indices with reverse_count on any unique field belonging to both the indices, and the responses with count 1 are my result. I am currently doing this by indexing the _id also as a field in the _source of the documents, but the easier way would be a facet on _id.

Is it possible?

Thanks in advance,
Sujoy.

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Faceting on _id field

Clinton Gormley-2
In reply to this post by sujoysett
Hi Sujoy

> Sorry if I am asking a too obvious question, but is term facet
> possible on the _id field of an index?

It is possible, but not by default.  You would have to reindex your
indices and map the _id field to { "store": "yes" }

> I can do this by running a facet query simultaneously on both indices
> with reverse_count on any unique field belonging to both the indices,
> and the responses with count 1 are my result. I am currently doing
> this by indexing the _id also as a field in the _source of the
> documents, but the easier way would be a facet on _id.

That's rather a nice approach.  One warning though: your IDs are unique
values, which mean that you have to load a LOT of unique terms to facet
on the _id field.  You may well run out of memory in the future, when
you try the same thing with millions of docs.

clint



--


Reply | Threaded
Open this post in threaded view
|

Re: Faceting on _id field

sujoysett

Thanks a lot Clint.

I will surely try that. I presume there is no way other than re-indexing to change the mapping of _id field to {"store": "yes"} for already existing docs? Currently nothing is mapped as such, so by default it is probably not stored.

Regards,
Sujoy.

On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton Gormley wrote:
Hi Sujoy

> Sorry if I am asking a too obvious question, but is term facet
> possible on the _id field of an index?

It is possible, but not by default.  You would have to reindex your
indices and map the _id field to { "store": "yes" }

> I can do this by running a facet query simultaneously on both indices
> with reverse_count on any unique field belonging to both the indices,
> and the responses with count 1 are my result. I am currently doing
> this by indexing the _id also as a field in the _source of the
> documents, but the easier way would be a facet on _id.

That's rather a nice approach.  One warning though: your IDs are unique
values, which mean that you have to load a LOT of unique terms to facet
on the _id field.  You may well run out of memory in the future, when
you try the same thing with millions of docs.

clint



--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Faceting on _id field

Clinton Gormley-2

> I will surely try that. I presume there is no way other than
> re-indexing to change the mapping of _id field to {"store": "yes"} for
> already existing docs? Currently nothing is mapped as such, so by
> default it is probably not stored.

Correct. You have to reindex

clint

>
>
> Regards,
> Sujoy.
>
> On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton Gormley
> wrote:
>         Hi Sujoy
>        
>         > Sorry if I am asking a too obvious question, but is term
>         facet
>         > possible on the _id field of an index?
>        
>         It is possible, but not by default.  You would have to reindex
>         your
>         indices and map the _id field to { "store": "yes" }
>        
>         > I can do this by running a facet query simultaneously on
>         both indices
>         > with reverse_count on any unique field belonging to both the
>         indices,
>         > and the responses with count 1 are my result. I am currently
>         doing
>         > this by indexing the _id also as a field in the _source of
>         the
>         > documents, but the easier way would be a facet on _id.
>        
>         That's rather a nice approach.  One warning though: your IDs
>         are unique
>         values, which mean that you have to load a LOT of unique terms
>         to facet
>         on the _id field.  You may well run out of memory in the
>         future, when
>         you try the same thing with millions of docs.
>        
>         clint
>        
>        
>        
>
> --
>  
>  


--


Reply | Threaded
Open this post in threaded view
|

Re: Faceting on _id field

sujoysett
Another strange problem based on the above assumption. I am doing terms facet on a field storing unique values in both the indexes.

I am querying like this : 

http://localhost:9200/index1,index2/_search
{
    "from": 0,
    "size": 0,
    "query": {
        "match_all": {}
    },
    "facets": {
        "temporaryFacetName": {
            "terms": {
                "field": "fieldName",
                "order": "reverse_count",
                "size": 100
            }
        }
    }
}

But this reverse_count ordering is not working correctly, neither is using count in place (I tried that too just as a wild guess).

I am assuming that ordering is getting done first on both the indexes separately, and then merging done, instead of merging results from both index first and ordering on combined results.

The query I stated above is giving expected result upto certain point of time after which the ordering goes wrong.

Any help?

Thanks in advance,
Sujoy.

On Monday, September 17, 2012 4:33:59 PM UTC+5:30, Clinton Gormley wrote:

> I will surely try that. I presume there is no way other than
> re-indexing to change the mapping of _id field to {"store": "yes"} for
> already existing docs? Currently nothing is mapped as such, so by
> default it is probably not stored.

Correct. You have to reindex

clint

>
>
> Regards,
> Sujoy.
>
> On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton Gormley
> wrote:
>         Hi Sujoy
>        
>         > Sorry if I am asking a too obvious question, but is term
>         facet
>         > possible on the _id field of an index?
>        
>         It is possible, but not by default.  You would have to reindex
>         your
>         indices and map the _id field to { "store": "yes" }
>        
>         > I can do this by running a facet query simultaneously on
>         both indices
>         > with reverse_count on any unique field belonging to both the
>         indices,
>         > and the responses with count 1 are my result. I am currently
>         doing
>         > this by indexing the _id also as a field in the _source of
>         the
>         > documents, but the easier way would be a facet on _id.
>        
>         That's rather a nice approach.  One warning though: your IDs
>         are unique
>         values, which mean that you have to load a LOT of unique terms
>         to facet
>         on the _id field.  You may well run out of memory in the
>         future, when
>         you try the same thing with millions of docs.
>        
>         clint
>        
>        
>        
>
> --
>  
>  


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Faceting on _id field

Clinton Gormley-2

>
> I am assuming that ordering is getting done first on both the indexes
> separately, and then merging done, instead of merging results from
> both index first and ordering on combined results.

Correct.

https://github.com/elasticsearch/elasticsearch/issues/1305


>
> The query I stated above is giving expected result upto certain point
> of time after which the ordering goes wrong.

The only way around it currently is to ask for many more terms that you
actually need... which will also use more RAM

clint

>
>
> Any help?
>
> Thanks in advance,
> Sujoy.
>
> On Monday, September 17, 2012 4:33:59 PM UTC+5:30, Clinton Gormley
> wrote:
>        
>         > I will surely try that. I presume there is no way other
>         than
>         > re-indexing to change the mapping of _id field to {"store":
>         "yes"} for
>         > already existing docs? Currently nothing is mapped as such,
>         so by
>         > default it is probably not stored.
>        
>         Correct. You have to reindex
>        
>         clint
>        
>         >
>         >
>         > Regards,
>         > Sujoy.
>         >
>         > On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton
>         Gormley
>         > wrote:
>         >         Hi Sujoy
>         >        
>         >         > Sorry if I am asking a too obvious question, but
>         is term
>         >         facet
>         >         > possible on the _id field of an index?
>         >        
>         >         It is possible, but not by default.  You would have
>         to reindex
>         >         your
>         >         indices and map the _id field to { "store": "yes" }
>         >        
>         >         > I can do this by running a facet query
>         simultaneously on
>         >         both indices
>         >         > with reverse_count on any unique field belonging
>         to both the
>         >         indices,
>         >         > and the responses with count 1 are my result. I am
>         currently
>         >         doing
>         >         > this by indexing the _id also as a field in the
>         _source of
>         >         the
>         >         > documents, but the easier way would be a facet on
>         _id.
>         >        
>         >         That's rather a nice approach.  One warning though:
>         your IDs
>         >         are unique
>         >         values, which mean that you have to load a LOT of
>         unique terms
>         >         to facet
>         >         on the _id field.  You may well run out of memory in
>         the
>         >         future, when
>         >         you try the same thing with millions of docs.
>         >        
>         >         clint
>         >        
>         >        
>         >        
>         >
>         > --
>         >  
>         >  
>        
>        
>
> --
>  
>  


--


Reply | Threaded
Open this post in threaded view
|

Re: Faceting on _id field

sujoysett

Hi Clint,

Thanks again.
Too bad ... It seems I have to fall back on good old scroll once again.

Regards,
Sujoy.

On Monday, September 17, 2012 4:46:06 PM UTC+5:30, Clinton Gormley wrote:

>
> I am assuming that ordering is getting done first on both the indexes
> separately, and then merging done, instead of merging results from
> both index first and ordering on combined results.

Correct.

https://github.com/elasticsearch/elasticsearch/issues/1305


>
> The query I stated above is giving expected result upto certain point
> of time after which the ordering goes wrong.

The only way around it currently is to ask for many more terms that you
actually need... which will also use more RAM

clint

>
>
> Any help?
>
> Thanks in advance,
> Sujoy.
>
> On Monday, September 17, 2012 4:33:59 PM UTC+5:30, Clinton Gormley
> wrote:
>        
>         > I will surely try that. I presume there is no way other
>         than
>         > re-indexing to change the mapping of _id field to {"store":
>         "yes"} for
>         > already existing docs? Currently nothing is mapped as such,
>         so by
>         > default it is probably not stored.
>        
>         Correct. You have to reindex
>        
>         clint
>        
>         >
>         >
>         > Regards,
>         > Sujoy.
>         >
>         > On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton
>         Gormley
>         > wrote:
>         >         Hi Sujoy
>         >        
>         >         > Sorry if I am asking a too obvious question, but
>         is term
>         >         facet
>         >         > possible on the _id field of an index?
>         >        
>         >         It is possible, but not by default.  You would have
>         to reindex
>         >         your
>         >         indices and map the _id field to { "store": "yes" }
>         >        
>         >         > I can do this by running a facet query
>         simultaneously on
>         >         both indices
>         >         > with reverse_count on any unique field belonging
>         to both the
>         >         indices,
>         >         > and the responses with count 1 are my result. I am
>         currently
>         >         doing
>         >         > this by indexing the _id also as a field in the
>         _source of
>         >         the
>         >         > documents, but the easier way would be a facet on
>         _id.
>         >        
>         >         That's rather a nice approach.  One warning though:
>         your IDs
>         >         are unique
>         >         values, which mean that you have to load a LOT of
>         unique terms
>         >         to facet
>         >         on the _id field.  You may well run out of memory in
>         the
>         >         future, when
>         >         you try the same thing with millions of docs.
>         >        
>         >         clint
>         >        
>         >        
>         >        
>         >
>         > --
>         >  
>         >  
>        
>        
>
> --
>  
>  


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Faceting on _id field

sujoysett
Hi All ....

I have two rivers working simultaneously .... twitter river fetching data and indexing in index A , and another custom river fetching data from index A, doing necessary processing and indexing it to index B.

At a certain instant, lets say, index A has x docs, and index B has y docs, and now I want to fetch those (x-y) docs ..... those which are in index A and not in index B. The count of docs are in millions.

I have tried using scroll to fetch ids from destination index B and matching against scroll of source index A, but it is quite time consuming. Is there any better way to achieve this?

Thanks.
Sujoy.

On Monday, September 17, 2012 5:20:29 PM UTC+5:30, Sujoy Sett wrote:

Hi Clint,

Thanks again.
Too bad ... It seems I have to fall back on good old scroll once again.

Regards,
Sujoy.

On Monday, September 17, 2012 4:46:06 PM UTC+5:30, Clinton Gormley wrote:

>
> I am assuming that ordering is getting done first on both the indexes
> separately, and then merging done, instead of merging results from
> both index first and ordering on combined results.

Correct.

https://github.com/elasticsearch/elasticsearch/issues/1305


>
> The query I stated above is giving expected result upto certain point
> of time after which the ordering goes wrong.

The only way around it currently is to ask for many more terms that you
actually need... which will also use more RAM

clint

>
>
> Any help?
>
> Thanks in advance,
> Sujoy.
>
> On Monday, September 17, 2012 4:33:59 PM UTC+5:30, Clinton Gormley
> wrote:
>        
>         > I will surely try that. I presume there is no way other
>         than
>         > re-indexing to change the mapping of _id field to {"store":
>         "yes"} for
>         > already existing docs? Currently nothing is mapped as such,
>         so by
>         > default it is probably not stored.
>        
>         Correct. You have to reindex
>        
>         clint
>        
>         >
>         >
>         > Regards,
>         > Sujoy.
>         >
>         > On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton
>         Gormley
>         > wrote:
>         >         Hi Sujoy
>         >        
>         >         > Sorry if I am asking a too obvious question, but
>         is term
>         >         facet
>         >         > possible on the _id field of an index?
>         >        
>         >         It is possible, but not by default.  You would have
>         to reindex
>         >         your
>         >         indices and map the _id field to { "store": "yes" }
>         >        
>         >         > I can do this by running a facet query
>         simultaneously on
>         >         both indices
>         >         > with reverse_count on any unique field belonging
>         to both the
>         >         indices,
>         >         > and the responses with count 1 are my result. I am
>         currently
>         >         doing
>         >         > this by indexing the _id also as a field in the
>         _source of
>         >         the
>         >         > documents, but the easier way would be a facet on
>         _id.
>         >        
>         >         That's rather a nice approach.  One warning though:
>         your IDs
>         >         are unique
>         >         values, which mean that you have to load a LOT of
>         unique terms
>         >         to facet
>         >         on the _id field.  You may well run out of memory in
>         the
>         >         future, when
>         >         you try the same thing with millions of docs.
>         >        
>         >         clint
>         >        
>         >        
>         >        
>         >
>         > --
>         >  
>         >  
>        
>        
>
> --
>  
>  


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Faceting on _id field

Igor Motov-3
The first thing that comes to mind is to add a timestamp field to the index A records and retrieved only records that were added/modified after the last synchronisation time. 

On Monday, October 22, 2012 4:02:54 PM UTC-4, Sujoy Sett wrote:
Hi All ....

I have two rivers working simultaneously .... twitter river fetching data and indexing in index A , and another custom river fetching data from index A, doing necessary processing and indexing it to index B.

At a certain instant, lets say, index A has x docs, and index B has y docs, and now I want to fetch those (x-y) docs ..... those which are in index A and not in index B. The count of docs are in millions.

I have tried using scroll to fetch ids from destination index B and matching against scroll of source index A, but it is quite time consuming. Is there any better way to achieve this?

Thanks.
Sujoy.

On Monday, September 17, 2012 5:20:29 PM UTC+5:30, Sujoy Sett wrote:

Hi Clint,

Thanks again.
Too bad ... It seems I have to fall back on good old scroll once again.

Regards,
Sujoy.

On Monday, September 17, 2012 4:46:06 PM UTC+5:30, Clinton Gormley wrote:

>
> I am assuming that ordering is getting done first on both the indexes
> separately, and then merging done, instead of merging results from
> both index first and ordering on combined results.

Correct.

https://github.com/elasticsearch/elasticsearch/issues/1305


>
> The query I stated above is giving expected result upto certain point
> of time after which the ordering goes wrong.

The only way around it currently is to ask for many more terms that you
actually need... which will also use more RAM

clint

>
>
> Any help?
>
> Thanks in advance,
> Sujoy.
>
> On Monday, September 17, 2012 4:33:59 PM UTC+5:30, Clinton Gormley
> wrote:
>        
>         > I will surely try that. I presume there is no way other
>         than
>         > re-indexing to change the mapping of _id field to {"store":
>         "yes"} for
>         > already existing docs? Currently nothing is mapped as such,
>         so by
>         > default it is probably not stored.
>        
>         Correct. You have to reindex
>        
>         clint
>        
>         >
>         >
>         > Regards,
>         > Sujoy.
>         >
>         > On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton
>         Gormley
>         > wrote:
>         >         Hi Sujoy
>         >        
>         >         > Sorry if I am asking a too obvious question, but
>         is term
>         >         facet
>         >         > possible on the _id field of an index?
>         >        
>         >         It is possible, but not by default.  You would have
>         to reindex
>         >         your
>         >         indices and map the _id field to { "store": "yes" }
>         >        
>         >         > I can do this by running a facet query
>         simultaneously on
>         >         both indices
>         >         > with reverse_count on any unique field belonging
>         to both the
>         >         indices,
>         >         > and the responses with count 1 are my result. I am
>         currently
>         >         doing
>         >         > this by indexing the _id also as a field in the
>         _source of
>         >         the
>         >         > documents, but the easier way would be a facet on
>         _id.
>         >        
>         >         That's rather a nice approach.  One warning though:
>         your IDs
>         >         are unique
>         >         values, which mean that you have to load a LOT of
>         unique terms
>         >         to facet
>         >         on the _id field.  You may well run out of memory in
>         the
>         >         future, when
>         >         you try the same thing with millions of docs.
>         >        
>         >         clint
>         >        
>         >        
>         >        
>         >
>         > --
>         >  
>         >  
>        
>        
>
> --
>  
>  


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Faceting on _id field

sujoysett
Thanks Igor, that can be quite useful. In fact, I had thought of assigning ID to all the docs not in random, but sequential order as in DB to achieve the same. My first river id the twitter river, and adding timestamp means editing its code for mapping.
It's really not a problem to modify the river, but just curious, doesn't elasticsearch has any such logic to find difference in place in itself?

Thanks,
-- Sujoy.

On Tuesday, October 23, 2012 1:40:10 AM UTC+5:30, Igor Motov wrote:
The first thing that comes to mind is to add a timestamp field to the index A records and retrieved only records that were added/modified after the last synchronisation time. 

On Monday, October 22, 2012 4:02:54 PM UTC-4, Sujoy Sett wrote:
Hi All ....

I have two rivers working simultaneously .... twitter river fetching data and indexing in index A , and another custom river fetching data from index A, doing necessary processing and indexing it to index B.

At a certain instant, lets say, index A has x docs, and index B has y docs, and now I want to fetch those (x-y) docs ..... those which are in index A and not in index B. The count of docs are in millions.

I have tried using scroll to fetch ids from destination index B and matching against scroll of source index A, but it is quite time consuming. Is there any better way to achieve this?

Thanks.
Sujoy.

On Monday, September 17, 2012 5:20:29 PM UTC+5:30, Sujoy Sett wrote:

Hi Clint,

Thanks again.
Too bad ... It seems I have to fall back on good old scroll once again.

Regards,
Sujoy.

On Monday, September 17, 2012 4:46:06 PM UTC+5:30, Clinton Gormley wrote:

>
> I am assuming that ordering is getting done first on both the indexes
> separately, and then merging done, instead of merging results from
> both index first and ordering on combined results.

Correct.

https://github.com/elasticsearch/elasticsearch/issues/1305


>
> The query I stated above is giving expected result upto certain point
> of time after which the ordering goes wrong.

The only way around it currently is to ask for many more terms that you
actually need... which will also use more RAM

clint

>
>
> Any help?
>
> Thanks in advance,
> Sujoy.
>
> On Monday, September 17, 2012 4:33:59 PM UTC+5:30, Clinton Gormley
> wrote:
>        
>         > I will surely try that. I presume there is no way other
>         than
>         > re-indexing to change the mapping of _id field to {"store":
>         "yes"} for
>         > already existing docs? Currently nothing is mapped as such,
>         so by
>         > default it is probably not stored.
>        
>         Correct. You have to reindex
>        
>         clint
>        
>         >
>         >
>         > Regards,
>         > Sujoy.
>         >
>         > On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton
>         Gormley
>         > wrote:
>         >         Hi Sujoy
>         >        
>         >         > Sorry if I am asking a too obvious question, but
>         is term
>         >         facet
>         >         > possible on the _id field of an index?
>         >        
>         >         It is possible, but not by default.  You would have
>         to reindex
>         >         your
>         >         indices and map the _id field to { "store": "yes" }
>         >        
>         >         > I can do this by running a facet query
>         simultaneously on
>         >         both indices
>         >         > with reverse_count on any unique field belonging
>         to both the
>         >         indices,
>         >         > and the responses with count 1 are my result. I am
>         currently
>         >         doing
>         >         > this by indexing the _id also as a field in the
>         _source of
>         >         the
>         >         > documents, but the easier way would be a facet on
>         _id.
>         >        
>         >         That's rather a nice approach.  One warning though:
>         your IDs
>         >         are unique
>         >         values, which mean that you have to load a LOT of
>         unique terms
>         >         to facet
>         >         on the _id field.  You may well run out of memory in
>         the
>         >         future, when
>         >         you try the same thing with millions of docs.
>         >        
>         >         clint
>         >        
>         >        
>         >        
>         >
>         > --
>         >  
>         >  
>        
>        
>
> --
>  
>  


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Faceting on _id field

Igor Motov-3
No, elasticsearch cannot calculate difference between two indices. Actually, I cannot think of any operation (except union) that elasticsearch can perform with two or more indices. 

You don't have to modify twitter river code. You can simply create a new index with desired mapping before creating twitter river.

On Monday, October 22, 2012 9:48:41 PM UTC-4, Sujoy Sett wrote:
Thanks Igor, that can be quite useful. In fact, I had thought of assigning ID to all the docs not in random, but sequential order as in DB to achieve the same. My first river id the twitter river, and adding timestamp means editing its code for mapping.
It's really not a problem to modify the river, but just curious, doesn't elasticsearch has any such logic to find difference in place in itself?

Thanks,
-- Sujoy.

On Tuesday, October 23, 2012 1:40:10 AM UTC+5:30, Igor Motov wrote:
The first thing that comes to mind is to add a timestamp field to the index A records and retrieved only records that were added/modified after the last synchronisation time. 

On Monday, October 22, 2012 4:02:54 PM UTC-4, Sujoy Sett wrote:
Hi All ....

I have two rivers working simultaneously .... twitter river fetching data and indexing in index A , and another custom river fetching data from index A, doing necessary processing and indexing it to index B.

At a certain instant, lets say, index A has x docs, and index B has y docs, and now I want to fetch those (x-y) docs ..... those which are in index A and not in index B. The count of docs are in millions.

I have tried using scroll to fetch ids from destination index B and matching against scroll of source index A, but it is quite time consuming. Is there any better way to achieve this?

Thanks.
Sujoy.

On Monday, September 17, 2012 5:20:29 PM UTC+5:30, Sujoy Sett wrote:

Hi Clint,

Thanks again.
Too bad ... It seems I have to fall back on good old scroll once again.

Regards,
Sujoy.

On Monday, September 17, 2012 4:46:06 PM UTC+5:30, Clinton Gormley wrote:

>
> I am assuming that ordering is getting done first on both the indexes
> separately, and then merging done, instead of merging results from
> both index first and ordering on combined results.

Correct.

https://github.com/elasticsearch/elasticsearch/issues/1305


>
> The query I stated above is giving expected result upto certain point
> of time after which the ordering goes wrong.

The only way around it currently is to ask for many more terms that you
actually need... which will also use more RAM

clint

>
>
> Any help?
>
> Thanks in advance,
> Sujoy.
>
> On Monday, September 17, 2012 4:33:59 PM UTC+5:30, Clinton Gormley
> wrote:
>        
>         > I will surely try that. I presume there is no way other
>         than
>         > re-indexing to change the mapping of _id field to {"store":
>         "yes"} for
>         > already existing docs? Currently nothing is mapped as such,
>         so by
>         > default it is probably not stored.
>        
>         Correct. You have to reindex
>        
>         clint
>        
>         >
>         >
>         > Regards,
>         > Sujoy.
>         >
>         > On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton
>         Gormley
>         > wrote:
>         >         Hi Sujoy
>         >        
>         >         > Sorry if I am asking a too obvious question, but
>         is term
>         >         facet
>         >         > possible on the _id field of an index?
>         >        
>         >         It is possible, but not by default.  You would have
>         to reindex
>         >         your
>         >         indices and map the _id field to { "store": "yes" }
>         >        
>         >         > I can do this by running a facet query
>         simultaneously on
>         >         both indices
>         >         > with reverse_count on any unique field belonging
>         to both the
>         >         indices,
>         >         > and the responses with count 1 are my result. I am
>         currently
>         >         doing
>         >         > this by indexing the _id also as a field in the
>         _source of
>         >         the
>         >         > documents, but the easier way would be a facet on
>         _id.
>         >        
>         >         That's rather a nice approach.  One warning though:
>         your IDs
>         >         are unique
>         >         values, which mean that you have to load a LOT of
>         unique terms
>         >         to facet
>         >         on the _id field.  You may well run out of memory in
>         the
>         >         future, when
>         >         you try the same thing with millions of docs.
>         >        
>         >         clint
>         >        
>         >        
>         >        
>         >
>         > --
>         >  
>         >  
>        
>        
>
> --
>  
>  


--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Faceting on _id field

sujoysett
Hi All,

Found a nice solution to the problem stated here, hence reviving this lost thread. Just thought of sharing so that it might help someone.



Problem Statement:

We perform processing on huge sets of documents from an index (primary), and push the processed documents (with new field, or even existing fields with new mapping) into a new index (secondary).
We use scroll to retrieve docs, but problem arises when the scroll breaks due to some network problem or any other issue.
At a certain point of time, finding unprocessed docs, that is, docs existing in the first primary index but not in the processed secondary index, via scroll, is the problem we were trying to solve.
ES does not have any direct MINUS / EXCEPT query to find difference of two indexes.
We did not find it comfortable overwriting the primary index itself with processing changes, as we often needed mapping or analyzer changes.



Solution:

Instead of creating a secondary index, we create a secondary type here, with _parent type mapping directed towards the primary type.
Insert necessary mapping changes in the secondary type, start scroll on primary type, push processed data into secondary type via bulk insert.
In the scroll, we use the following query to get docs wich are in primary type but not in secondary type - 
POST localhost:9200/# index/# primary_type/_search?search_type=scan
{
    "filter": {
        "not": {
            "has_child": {
                "type": "# secondary_type",
                "query": {
                    "filtered": {
                        "query": {
                            "match_all": {}
                        }
                    }
                }
            }
        }
    }
}
At any certain point of time, finding difference between two ES data sets (something like SQL MINUS or EXCEPT clause) is thus possible.
It also helps us in reprocessing some selective documents, if required, by modifying the has_child filter accordingly.


-- Sujoy.





On Tuesday, October 23, 2012 8:01:19 AM UTC+5:30, Igor Motov wrote:
No, elasticsearch cannot calculate difference between two indices. Actually, I cannot think of any operation (except union) that elasticsearch can perform with two or more indices. 

You don't have to modify twitter river code. You can simply create a new index with desired mapping before creating twitter river.

On Monday, October 22, 2012 9:48:41 PM UTC-4, Sujoy Sett wrote:
Thanks Igor, that can be quite useful. In fact, I had thought of assigning ID to all the docs not in random, but sequential order as in DB to achieve the same. My first river id the twitter river, and adding timestamp means editing its code for mapping.
It's really not a problem to modify the river, but just curious, doesn't elasticsearch has any such logic to find difference in place in itself?

Thanks,
-- Sujoy.

On Tuesday, October 23, 2012 1:40:10 AM UTC+5:30, Igor Motov wrote:
The first thing that comes to mind is to add a timestamp field to the index A records and retrieved only records that were added/modified after the last synchronisation time. 

On Monday, October 22, 2012 4:02:54 PM UTC-4, Sujoy Sett wrote:
Hi All ....

I have two rivers working simultaneously .... twitter river fetching data and indexing in index A , and another custom river fetching data from index A, doing necessary processing and indexing it to index B.

At a certain instant, lets say, index A has x docs, and index B has y docs, and now I want to fetch those (x-y) docs ..... those which are in index A and not in index B. The count of docs are in millions.

I have tried using scroll to fetch ids from destination index B and matching against scroll of source index A, but it is quite time consuming. Is there any better way to achieve this?

Thanks.
Sujoy.

On Monday, September 17, 2012 5:20:29 PM UTC+5:30, Sujoy Sett wrote:

Hi Clint,

Thanks again.
Too bad ... It seems I have to fall back on good old scroll once again.

Regards,
Sujoy.

On Monday, September 17, 2012 4:46:06 PM UTC+5:30, Clinton Gormley wrote:

>
> I am assuming that ordering is getting done first on both the indexes
> separately, and then merging done, instead of merging results from
> both index first and ordering on combined results.

Correct.

https://github.com/elasticsearch/elasticsearch/issues/1305


>
> The query I stated above is giving expected result upto certain point
> of time after which the ordering goes wrong.

The only way around it currently is to ask for many more terms that you
actually need... which will also use more RAM

clint

>
>
> Any help?
>
> Thanks in advance,
> Sujoy.
>
> On Monday, September 17, 2012 4:33:59 PM UTC+5:30, Clinton Gormley
> wrote:
>        
>         > I will surely try that. I presume there is no way other
>         than
>         > re-indexing to change the mapping of _id field to {"store":
>         "yes"} for
>         > already existing docs? Currently nothing is mapped as such,
>         so by
>         > default it is probably not stored.
>        
>         Correct. You have to reindex
>        
>         clint
>        
>         >
>         >
>         > Regards,
>         > Sujoy.
>         >
>         > On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton
>         Gormley
>         > wrote:
>         >         Hi Sujoy
>         >        
>         >         > Sorry if I am asking a too obvious question, but
>         is term
>         >         facet
>         >         > possible on the _id field of an index?
>         >        
>         >         It is possible, but not by default.  You would have
>         to reindex
>         >         your
>         >         indices and map the _id field to { "store": "yes" }
>         >        
>         >         > I can do this by running a facet query
>         simultaneously on
>         >         both indices
>         >         > with reverse_count on any unique field belonging
>         to both the
>         >         indices,
>         >         > and the responses with count 1 are my result. I am
>         currently
>         >         doing
>         >         > this by indexing the _id also as a field in the
>         _source of
>         >         the
>         >         > documents, but the easier way would be a facet on
>         _id.
>         >        
>         >         That's rather a nice approach.  One warning though:
>         your IDs
>         >         are unique
>         >         values, which mean that you have to load a LOT of
>         unique terms
>         >         to facet
>         >         on the _id field.  You may well run out of memory in
>         the
>         >         future, when
>         >         you try the same thing with millions of docs.
>         >        
>         >         clint
>         >        
>         >        
>         >        
>         >
>         > --
>         >  
>         >  
>        
>        
>
> --
>  
>  


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Faceting on _id field

jagdeep
Its awesome Sujoy. We have been struggling with this for long and now it will empower our thought for ES based processing engine.
Cheers to Elasticsearch!!!

Regards
Jagdeep

On Wednesday, May 15, 2013 4:24:45 PM UTC+5:30, Sujoy Sett wrote:
Hi All,

Found a nice solution to the problem stated here, hence reviving this lost thread. Just thought of sharing so that it might help someone.



Problem Statement:

We perform processing on huge sets of documents from an index (primary), and push the processed documents (with new field, or even existing fields with new mapping) into a new index (secondary).
We use scroll to retrieve docs, but problem arises when the scroll breaks due to some network problem or any other issue.
At a certain point of time, finding unprocessed docs, that is, docs existing in the first primary index but not in the processed secondary index, via scroll, is the problem we were trying to solve.
ES does not have any direct MINUS / EXCEPT query to find difference of two indexes.
We did not find it comfortable overwriting the primary index itself with processing changes, as we often needed mapping or analyzer changes.



Solution:

Instead of creating a secondary index, we create a secondary type here, with _parent type mapping directed towards the primary type.
Insert necessary mapping changes in the secondary type, start scroll on primary type, push processed data into secondary type via bulk insert.
In the scroll, we use the following query to get docs wich are in primary type but not in secondary type - 
POST localhost:9200/# index/# primary_type/_search?search_type=scan
{
    "filter": {
        "not": {
            "has_child": {
                "type": "# secondary_type",
                "query": {
                    "filtered": {
                        "query": {
                            "match_all": {}
                        }
                    }
                }
            }
        }
    }
}
At any certain point of time, finding difference between two ES data sets (something like SQL MINUS or EXCEPT clause) is thus possible.
It also helps us in reprocessing some selective documents, if required, by modifying the has_child filter accordingly.


-- Sujoy.





On Tuesday, October 23, 2012 8:01:19 AM UTC+5:30, Igor Motov wrote:
No, elasticsearch cannot calculate difference between two indices. Actually, I cannot think of any operation (except union) that elasticsearch can perform with two or more indices. 

You don't have to modify twitter river code. You can simply create a new index with desired mapping before creating twitter river.

On Monday, October 22, 2012 9:48:41 PM UTC-4, Sujoy Sett wrote:
Thanks Igor, that can be quite useful. In fact, I had thought of assigning ID to all the docs not in random, but sequential order as in DB to achieve the same. My first river id the twitter river, and adding timestamp means editing its code for mapping.
It's really not a problem to modify the river, but just curious, doesn't elasticsearch has any such logic to find difference in place in itself?

Thanks,
-- Sujoy.

On Tuesday, October 23, 2012 1:40:10 AM UTC+5:30, Igor Motov wrote:
The first thing that comes to mind is to add a timestamp field to the index A records and retrieved only records that were added/modified after the last synchronisation time. 

On Monday, October 22, 2012 4:02:54 PM UTC-4, Sujoy Sett wrote:
Hi All ....

I have two rivers working simultaneously .... twitter river fetching data and indexing in index A , and another custom river fetching data from index A, doing necessary processing and indexing it to index B.

At a certain instant, lets say, index A has x docs, and index B has y docs, and now I want to fetch those (x-y) docs ..... those which are in index A and not in index B. The count of docs are in millions.

I have tried using scroll to fetch ids from destination index B and matching against scroll of source index A, but it is quite time consuming. Is there any better way to achieve this?

Thanks.
Sujoy.

On Monday, September 17, 2012 5:20:29 PM UTC+5:30, Sujoy Sett wrote:

Hi Clint,

Thanks again.
Too bad ... It seems I have to fall back on good old scroll once again.

Regards,
Sujoy.

On Monday, September 17, 2012 4:46:06 PM UTC+5:30, Clinton Gormley wrote:

>
> I am assuming that ordering is getting done first on both the indexes
> separately, and then merging done, instead of merging results from
> both index first and ordering on combined results.

Correct.

https://github.com/elasticsearch/elasticsearch/issues/1305


>
> The query I stated above is giving expected result upto certain point
> of time after which the ordering goes wrong.

The only way around it currently is to ask for many more terms that you
actually need... which will also use more RAM

clint

>
>
> Any help?
>
> Thanks in advance,
> Sujoy.
>
> On Monday, September 17, 2012 4:33:59 PM UTC+5:30, Clinton Gormley
> wrote:
>        
>         > I will surely try that. I presume there is no way other
>         than
>         > re-indexing to change the mapping of _id field to {"store":
>         "yes"} for
>         > already existing docs? Currently nothing is mapped as such,
>         so by
>         > default it is probably not stored.
>        
>         Correct. You have to reindex
>        
>         clint
>        
>         >
>         >
>         > Regards,
>         > Sujoy.
>         >
>         > On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton
>         Gormley
>         > wrote:
>         >         Hi Sujoy
>         >        
>         >         > Sorry if I am asking a too obvious question, but
>         is term
>         >         facet
>         >         > possible on the _id field of an index?
>         >        
>         >         It is possible, but not by default.  You would have
>         to reindex
>         >         your
>         >         indices and map the _id field to { "store": "yes" }
>         >        
>         >         > I can do this by running a facet query
>         simultaneously on
>         >         both indices
>         >         > with reverse_count on any unique field belonging
>         to both the
>         >         indices,
>         >         > and the responses with count 1 are my result. I am
>         currently
>         >         doing
>         >         > this by indexing the _id also as a field in the
>         _source of
>         >         the
>         >         > documents, but the easier way would be a facet on
>         _id.
>         >        
>         >         That's rather a nice approach.  One warning though:
>         your IDs
>         >         are unique
>         >         values, which mean that you have to load a LOT of
>         unique terms
>         >         to facet
>         >         on the _id field.  You may well run out of memory in
>         the
>         >         future, when
>         >         you try the same thing with millions of docs.
>         >        
>         >         clint
>         >        
>         >        
>         >        
>         >
>         > --
>         >  
>         >  
>        
>        
>
> --
>  
>  


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.