Distributed unique constraints

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Distributed unique constraints

Steff
Hi

Are there some way of enforcing unique constraints on documents in ES.
E.g. saying to ES that maximum one document are allowed in an index
where field "key1" and "key2" have simular values. E.g. on the structure
of data in the example on
http://www.elasticsearch.org/guide/reference/api/index_.html, can I
somehow tell ES that if a document with "user=kimchy" and
"post_date=2009-11-15T14:12:12" already have been indexed into the
index, then no other documents with the exact same values for user and
post_date are allowed to be indexed?

It would be nice with such a unique constraint feature across an index,
but to avoid communication overhead among nodes the feature might only
work on the same shard within the specific index (then it is up to the
applicaiton using ES to make sure the documents that might collide with
respect to unique constraint will be routed to the same shard). Are the
any support for unique constraints - on index-level or at least on
shard-level?

Is the "id" of documents at least under unique constraint limitations?
Or are you allowed to have more documents in the same index (or shard)
with the same id-value?

Regards, Per Steffensen
Reply | Threaded
Open this post in threaded view
|

Re: Distributed unique constraints

kimchy
Administrator
There is no support for unique constraints (and probably won't be because of both the limitation of distributed notion, and non real time search). You can't have several docs with the same _id on the other hand, and you can actually index a document with a "create" op_type, which will cause the indexing to fail if there is already a document indexed under the same _id.

On Mon, Sep 12, 2011 at 10:39 AM, Per Steffensen <[hidden email]> wrote:
Hi

Are there some way of enforcing unique constraints on documents in ES. E.g. saying to ES that maximum one document are allowed in an index where field "key1" and "key2" have simular values. E.g. on the structure of data in the example on http://www.elasticsearch.org/guide/reference/api/index_.html, can I somehow tell ES that if a document with "user=kimchy" and "post_date=2009-11-15T14:12:12" already have been indexed into the index, then no other documents with the exact same values for user and post_date are allowed to be indexed?

It would be nice with such a unique constraint feature across an index, but to avoid communication overhead among nodes the feature might only work on the same shard within the specific index (then it is up to the applicaiton using ES to make sure the documents that might collide with respect to unique constraint will be routed to the same shard). Are the any support for unique constraints - on index-level or at least on shard-level?

Is the "id" of documents at least under unique constraint limitations? Or are you allowed to have more documents in the same index (or shard) with the same id-value?

Regards, Per Steffensen

Reply | Threaded
Open this post in threaded view
|

Re: Distributed unique constraints

Steff
Shay Banon skrev:
There is no support for unique constraints (and probably won't be because of both the limitation of distributed notion, and non real time search). You can't have several docs with the same _id on the other hand, and you can actually index a document with a "create" op_type, which will cause the indexing to fail if there is already a document indexed under the same _id.
Thanks, Shay. Then in my world there IS unique constraint support - but only on _id field (no user-defined unique constaints). Can you say something about the "scope" of that unique constraint on _id - is it per index or only per shard in the index? If it is per index, I guess the feature will actually be a scalability-limit, not allowing ES to scale "to infinity" (but probably very very far) with respect to "number of nodes involved in serving an specific index". But maybe not, can you say a little more about how it is implemented, with respect to the communication needed amoung nodes running shards in the index, in order to maintain the unique constaint on _id's in the entire index?

You say that I need to add a "create" op_type in order to make the index operation fail if it violates the unique constaint on _id. I would expect the index operation to fail anyway - what other possible outcome is there when an index operation violates the _id unique constaint? What happens if I try to index a new document with an _id that is already used by an existing document in the index, and I do not add the "create" op_type thing that you mention?

On Mon, Sep 12, 2011 at 10:39 AM, Per Steffensen <[hidden email]> wrote:
Hi

Are there some way of enforcing unique constraints on documents in ES. E.g. saying to ES that maximum one document are allowed in an index where field "key1" and "key2" have simular values. E.g. on the structure of data in the example on http://www.elasticsearch.org/guide/reference/api/index_.html, can I somehow tell ES that if a document with "user=kimchy" and "post_date=2009-11-15T14:12:12" already have been indexed into the index, then no other documents with the exact same values for user and post_date are allowed to be indexed?

It would be nice with such a unique constraint feature across an index, but to avoid communication overhead among nodes the feature might only work on the same shard within the specific index (then it is up to the applicaiton using ES to make sure the documents that might collide with respect to unique constraint will be routed to the same shard). Are the any support for unique constraints - on index-level or at least on shard-level?

Is the "id" of documents at least under unique constraint limitations? Or are you allowed to have more documents in the same index (or shard) with the same id-value?

Regards, Per Steffensen


Reply | Threaded
Open this post in threaded view
|

Re: Distributed unique constraints

Benjamin Devèze
The _id is unique on a per type basis. So if you have an index twitter with 2 types in it tweet1 and tweet2 you can use the same _id for a document of type tweet1 and a document of type tweet2.

If you send a request to index a document with an existing _id this will update the doc.
Reply | Threaded
Open this post in threaded view
|

Re: Distributed unique constraints

kimchy
Administrator
In reply to this post by Steff
Adding to Benjamin Response inline:

On Mon, Sep 12, 2011 at 11:52 AM, Per Steffensen <[hidden email]> wrote:
Shay Banon skrev:
There is no support for unique constraints (and probably won't be because of both the limitation of distributed notion, and non real time search). You can't have several docs with the same _id on the other hand, and you can actually index a document with a "create" op_type, which will cause the indexing to fail if there is already a document indexed under the same _id.
Thanks, Shay. Then in my world there IS unique constraint support - but only on _id field (no user-defined unique constaints). Can you say something about the "scope" of that unique constraint on _id - is it per index or only per shard in the index? If it is per index, I guess the feature will actually be a scalability-limit, not allowing ES to scale "to infinity" (but probably very very far) with respect to "number of nodes involved in serving an specific index". But maybe not, can you say a little more about how it is implemented, with respect to the communication needed amoung nodes running shards in the index, in order to maintain the unique constaint on _id's in the entire index?

A document unique id is the tuple its type and id. Since a document can't exists in two shards at the same time, the scope is "index" wise but the check is shard wise.
 

You say that I need to add a "create" op_type in order to make the index operation fail if it violates the unique constaint on _id. I would expect the index operation to fail anyway - what other possible outcome is there when an index operation violates the _id unique constaint? What happens if I try to index a new document with an _id that is already used by an existing document in the index, and I do not add the "create" op_type thing that you mention?

Updating the document.
 


On Mon, Sep 12, 2011 at 10:39 AM, Per Steffensen <[hidden email]> wrote:
Hi

Are there some way of enforcing unique constraints on documents in ES. E.g. saying to ES that maximum one document are allowed in an index where field "key1" and "key2" have simular values. E.g. on the structure of data in the example on http://www.elasticsearch.org/guide/reference/api/index_.html, can I somehow tell ES that if a document with "user=kimchy" and "post_date=2009-11-15T14:12:12" already have been indexed into the index, then no other documents with the exact same values for user and post_date are allowed to be indexed?

It would be nice with such a unique constraint feature across an index, but to avoid communication overhead among nodes the feature might only work on the same shard within the specific index (then it is up to the applicaiton using ES to make sure the documents that might collide with respect to unique constraint will be routed to the same shard). Are the any support for unique constraints - on index-level or at least on shard-level?

Is the "id" of documents at least under unique constraint limitations? Or are you allowed to have more documents in the same index (or shard) with the same id-value?

Regards, Per Steffensen



Reply | Threaded
Open this post in threaded view
|

Re: Distributed unique constraints

Steff
Shay Banon skrev:
Adding to Benjamin Response inline:

On Mon, Sep 12, 2011 at 11:52 AM, Per Steffensen <[hidden email]> wrote:
Shay Banon skrev:
There is no support for unique constraints (and probably won't be because of both the limitation of distributed notion, and non real time search). You can't have several docs with the same _id on the other hand, and you can actually index a document with a "create" op_type, which will cause the indexing to fail if there is already a document indexed under the same _id.
Thanks, Shay. Then in my world there IS unique constraint support - but only on _id field (no user-defined unique constaints). Can you say something about the "scope" of that unique constraint on _id - is it per index or only per shard in the index? If it is per index, I guess the feature will actually be a scalability-limit, not allowing ES to scale "to infinity" (but probably very very far) with respect to "number of nodes involved in serving an specific index". But maybe not, can you say a little more about how it is implemented, with respect to the communication needed amoung nodes running shards in the index, in order to maintain the unique constaint on _id's in the entire index?

A document unique id is the tuple its type and id. Since a document can't exists in two shards at the same time, the scope is "index" wise but the check is shard wise.
But for that to work, will that not require for me as a user of ES (writing applications using ES) to make sure that documents with the same type/_id is routed to the same shard? Imagine the situation, where a document with type/_id equal to tweet1/1234 has already been indexed with a routing value making it go to shard1. Now my app tries to index a new document with the same type/_id values tweet1/1234, but with a different routing value making it go to shard2. In order to make sure that the unique constaint on type/_id is not violated ES needs to ask all shard (especially shard1) if they already contain a document with type/_id equal to tweet1/1234 - it is not enough to just ask the target (shard2) of the new document if it already contains a document with type/_id equal to tweet1/1234.  Or didnt I understand routing correctly? So basically because type/_id does not uniquely define the shard that gets to index the document, all shard needs to be contacted when a new document is indexed, in order to make sure it does not violate the unique constaint on type/_id.
 

You say that I need to add a "create" op_type in order to make the index operation fail if it violates the unique constaint on _id. I would expect the index operation to fail anyway - what other possible outcome is there when an index operation violates the _id unique constaint? What happens if I try to index a new document with an _id that is already used by an existing document in the index, and I do not add the "create" op_type thing that you mention?

Updating the document.
Ok, thanks!
 


On Mon, Sep 12, 2011 at 10:39 AM, Per Steffensen <[hidden email]> wrote:
Hi

Are there some way of enforcing unique constraints on documents in ES. E.g. saying to ES that maximum one document are allowed in an index where field "key1" and "key2" have simular values. E.g. on the structure of data in the example on http://www.elasticsearch.org/guide/reference/api/index_.html, can I somehow tell ES that if a document with "user=kimchy" and "post_date=2009-11-15T14:12:12" already have been indexed into the index, then no other documents with the exact same values for user and post_date are allowed to be indexed?

It would be nice with such a unique constraint feature across an index, but to avoid communication overhead among nodes the feature might only work on the same shard within the specific index (then it is up to the applicaiton using ES to make sure the documents that might collide with respect to unique constraint will be routed to the same shard). Are the any support for unique constraints - on index-level or at least on shard-level?

Is the "id" of documents at least under unique constraint limitations? Or are you allowed to have more documents in the same index (or shard) with the same id-value?

Regards, Per Steffensen




Reply | Threaded
Open this post in threaded view
|

Re: Distributed unique constraints

kimchy
Administrator
If you use a custom routing value, then you have to make sure you use that routing value when you want to update the document, yes.

On Mon, Sep 12, 2011 at 12:45 PM, Per Steffensen <[hidden email]> wrote:
Shay Banon skrev:
Adding to Benjamin Response inline:

On Mon, Sep 12, 2011 at 11:52 AM, Per Steffensen <[hidden email]> wrote:
Shay Banon skrev:
There is no support for unique constraints (and probably won't be because of both the limitation of distributed notion, and non real time search). You can't have several docs with the same _id on the other hand, and you can actually index a document with a "create" op_type, which will cause the indexing to fail if there is already a document indexed under the same _id.
Thanks, Shay. Then in my world there IS unique constraint support - but only on _id field (no user-defined unique constaints). Can you say something about the "scope" of that unique constraint on _id - is it per index or only per shard in the index? If it is per index, I guess the feature will actually be a scalability-limit, not allowing ES to scale "to infinity" (but probably very very far) with respect to "number of nodes involved in serving an specific index". But maybe not, can you say a little more about how it is implemented, with respect to the communication needed amoung nodes running shards in the index, in order to maintain the unique constaint on _id's in the entire index?

A document unique id is the tuple its type and id. Since a document can't exists in two shards at the same time, the scope is "index" wise but the check is shard wise.
But for that to work, will that not require for me as a user of ES (writing applications using ES) to make sure that documents with the same type/_id is routed to the same shard? Imagine the situation, where a document with type/_id equal to tweet1/1234 has already been indexed with a routing value making it go to shard1. Now my app tries to index a new document with the same type/_id values tweet1/1234, but with a different routing value making it go to shard2. In order to make sure that the unique constaint on type/_id is not violated ES needs to ask all shard (especially shard1) if they already contain a document with type/_id equal to tweet1/1234 - it is not enough to just ask the target (shard2) of the new document if it already contains a document with type/_id equal to tweet1/1234.  Or didnt I understand routing correctly? So basically because type/_id does not uniquely define the shard that gets to index the document, all shard needs to be contacted when a new document is indexed, in order to make sure it does not violate the unique constaint on type/_id.

 

You say that I need to add a "create" op_type in order to make the index operation fail if it violates the unique constaint on _id. I would expect the index operation to fail anyway - what other possible outcome is there when an index operation violates the _id unique constaint? What happens if I try to index a new document with an _id that is already used by an existing document in the index, and I do not add the "create" op_type thing that you mention?

Updating the document.
Ok, thanks!

 


On Mon, Sep 12, 2011 at 10:39 AM, Per Steffensen <[hidden email]> wrote:
Hi

Are there some way of enforcing unique constraints on documents in ES. E.g. saying to ES that maximum one document are allowed in an index where field "key1" and "key2" have simular values. E.g. on the structure of data in the example on http://www.elasticsearch.org/guide/reference/api/index_.html, can I somehow tell ES that if a document with "user=kimchy" and "post_date=2009-11-15T14:12:12" already have been indexed into the index, then no other documents with the exact same values for user and post_date are allowed to be indexed?

It would be nice with such a unique constraint feature across an index, but to avoid communication overhead among nodes the feature might only work on the same shard within the specific index (then it is up to the applicaiton using ES to make sure the documents that might collide with respect to unique constraint will be routed to the same shard). Are the any support for unique constraints - on index-level or at least on shard-level?

Is the "id" of documents at least under unique constraint limitations? Or are you allowed to have more documents in the same index (or shard) with the same id-value?

Regards, Per Steffensen





Reply | Threaded
Open this post in threaded view
|

Re: Distributed unique constraints

Steff
Shay Banon skrev:
If you use a custom routing value, then you have to make sure you use that routing value when you want to update the document, yes.
Thanks. I would state that very clearly in the documentation about routing.

On Mon, Sep 12, 2011 at 12:45 PM, Per Steffensen <[hidden email]> wrote:
Shay Banon skrev:
Adding to Benjamin Response inline:

On Mon, Sep 12, 2011 at 11:52 AM, Per Steffensen <[hidden email]> wrote:
Shay Banon skrev:
There is no support for unique constraints (and probably won't be because of both the limitation of distributed notion, and non real time search). You can't have several docs with the same _id on the other hand, and you can actually index a document with a "create" op_type, which will cause the indexing to fail if there is already a document indexed under the same _id.
Thanks, Shay. Then in my world there IS unique constraint support - but only on _id field (no user-defined unique constaints). Can you say something about the "scope" of that unique constraint on _id - is it per index or only per shard in the index? If it is per index, I guess the feature will actually be a scalability-limit, not allowing ES to scale "to infinity" (but probably very very far) with respect to "number of nodes involved in serving an specific index". But maybe not, can you say a little more about how it is implemented, with respect to the communication needed amoung nodes running shards in the index, in order to maintain the unique constaint on _id's in the entire index?

A document unique id is the tuple its type and id. Since a document can't exists in two shards at the same time, the scope is "index" wise but the check is shard wise.
But for that to work, will that not require for me as a user of ES (writing applications using ES) to make sure that documents with the same type/_id is routed to the same shard? Imagine the situation, where a document with type/_id equal to tweet1/1234 has already been indexed with a routing value making it go to shard1. Now my app tries to index a new document with the same type/_id values tweet1/1234, but with a different routing value making it go to shard2. In order to make sure that the unique constaint on type/_id is not violated ES needs to ask all shard (especially shard1) if they already contain a document with type/_id equal to tweet1/1234 - it is not enough to just ask the target (shard2) of the new document if it already contains a document with type/_id equal to tweet1/1234.  Or didnt I understand routing correctly? So basically because type/_id does not uniquely define the shard that gets to index the document, all shard needs to be contacted when a new document is indexed, in order to make sure it does not violate the unique constaint on type/_id.

 

You say that I need to add a "create" op_type in order to make the index operation fail if it violates the unique constaint on _id. I would expect the index operation to fail anyway - what other possible outcome is there when an index operation violates the _id unique constaint? What happens if I try to index a new document with an _id that is already used by an existing document in the index, and I do not add the "create" op_type thing that you mention?

Updating the document.
Ok, thanks!

 


On Mon, Sep 12, 2011 at 10:39 AM, Per Steffensen <[hidden email]> wrote:
Hi

Are there some way of enforcing unique constraints on documents in ES. E.g. saying to ES that maximum one document are allowed in an index where field "key1" and "key2" have simular values. E.g. on the structure of data in the example on http://www.elasticsearch.org/guide/reference/api/index_.html, can I somehow tell ES that if a document with "user=kimchy" and "post_date=2009-11-15T14:12:12" already have been indexed into the index, then no other documents with the exact same values for user and post_date are allowed to be indexed?

It would be nice with such a unique constraint feature across an index, but to avoid communication overhead among nodes the feature might only work on the same shard within the specific index (then it is up to the applicaiton using ES to make sure the documents that might collide with respect to unique constraint will be routed to the same shard). Are the any support for unique constraints - on index-level or at least on shard-level?

Is the "id" of documents at least under unique constraint limitations? Or are you allowed to have more documents in the same index (or shard) with the same id-value?

Regards, Per Steffensen






Reply | Threaded
Open this post in threaded view
|

Re: Distributed unique constraints

onejigtwo
Would it thus make sense to generate an id for the document composed of its field values that would uniquely identify it from other documents.  For example, the id of a document that defines a location would have an id that could possibly be a string concatenation or encoded concatenation of (1) the country, (2) the city, (3) the longitude and latitude, and perhaps (4) a unique name? Would this be considered bad practice?