TTL for documents

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

TTL for documents

Benjamin Devèze
A lot of documents naturally come with an expiration date. I think it
would be nice to have a built-in support for a TTL/doc (with
eventually default TTLs configurable per types/indices). I know disks
are not expensive these days but it still is a common usage to use TTL
for documents and it can be a very useful feature especially for
people using ES as a key value storage. It is a pain to let the user
trigger regularly some delete by query jobs to purge the data and I
think it is a common enough use case to include it in the core of ES.

Concerning the implementation of this feature I propose to introduce a
special _ttl field. When searching documents ES could hide expired
ones (maybe adding a not query for expired docs in each query or
something smarter). The documents could be really deleted and disk
space liberated during segments merges.

What do you think?
Reply | Threaded
Open this post in threaded view
|

Re: TTL for documents

kimchy
Administrator
Heya,

  Yes, its possible to add this feature. I think there is already an issue open for something similar... . Would love to hear what other people think...

-shay.banon

On Wed, Jul 27, 2011 at 12:48 AM, Benjamin Devèze <[hidden email]> wrote:
A lot of documents naturally come with an expiration date. I think it
would be nice to have a built-in support for a TTL/doc (with
eventually default TTLs configurable per types/indices). I know disks
are not expensive these days but it still is a common usage to use TTL
for documents and it can be a very useful feature especially for
people using ES as a key value storage. It is a pain to let the user
trigger regularly some delete by query jobs to purge the data and I
think it is a common enough use case to include it in the core of ES.

Concerning the implementation of this feature I propose to introduce a
special _ttl field. When searching documents ES could hide expired
ones (maybe adding a not query for expired docs in each query or
something smarter). The documents could be really deleted and disk
space liberated during segments merges.

What do you think?

Reply | Threaded
Open this post in threaded view
|

Re: TTL for documents

Michel Conrad
I've already been wondering why I couldn't send no more message to the list ;-)
Well anyway here is what I wanted to post:


+1
I would love to see this feature implemented as I am doing something
similar on the client side and it would be much simpler if the server
could take care of the expiration date.

Best,
Michel

On Wed, Jul 27, 2011 at 6:14 AM, Shay Banon
<[hidden email]> wrote:

> Heya,
>   Yes, its possible to add this feature. I think there is already an issue
> open for something similar... . Would love to hear what other people
> think...
> -shay.banon
>
> On Wed, Jul 27, 2011 at 12:48 AM, Benjamin Devèze
> <[hidden email]> wrote:
>>
>> A lot of documents naturally come with an expiration date. I think it
>> would be nice to have a built-in support for a TTL/doc (with
>> eventually default TTLs configurable per types/indices). I know disks
>> are not expensive these days but it still is a common usage to use TTL
>> for documents and it can be a very useful feature especially for
>> people using ES as a key value storage. It is a pain to let the user
>> trigger regularly some delete by query jobs to purge the data and I
>> think it is a common enough use case to include it in the core of ES.
>>
>> Concerning the implementation of this feature I propose to introduce a
>> special _ttl field. When searching documents ES could hide expired
>> ones (maybe adding a not query for expired docs in each query or
>> something smarter). The documents could be really deleted and disk
>> space liberated during segments merges.
>>
>> What do you think?
>
>
Reply | Threaded
Open this post in threaded view
|

Re: TTL for documents

kimchy
Administrator
Yea, annoyed by it :). Btw, wanted to post another important note on managing expiring data. Another way of doing it, assuming it applies to the usecase, it to create an index per timespan, and then expire data by simply deleting old indices. The benefit of this usage pattern is the fact that deleting an index is much faster and has less strain on the system then deleting specific documents from an index (which will have to be merged out).

For example, you could index log data into a single index, and have a ttl for it of 2 weeks. A better solution would be to create an index per week, and delete old indices that pass the 2 weeks mark.

-shay.banon

On Wed, Jul 27, 2011 at 11:22 AM, Michel Conrad <[hidden email]> wrote:
I've already been wondering why I couldn't send no more message to the list ;-)
Well anyway here is what I wanted to post:


+1
I would love to see this feature implemented as I am doing something
similar on the client side and it would be much simpler if the server
could take care of the expiration date.

Best,
Michel

On Wed, Jul 27, 2011 at 6:14 AM, Shay Banon
<[hidden email]> wrote:
> Heya,
>   Yes, its possible to add this feature. I think there is already an issue
> open for something similar... . Would love to hear what other people
> think...
> -shay.banon
>
> On Wed, Jul 27, 2011 at 12:48 AM, Benjamin Devèze
> <[hidden email]> wrote:
>>
>> A lot of documents naturally come with an expiration date. I think it
>> would be nice to have a built-in support for a TTL/doc (with
>> eventually default TTLs configurable per types/indices). I know disks
>> are not expensive these days but it still is a common usage to use TTL
>> for documents and it can be a very useful feature especially for
>> people using ES as a key value storage. It is a pain to let the user
>> trigger regularly some delete by query jobs to purge the data and I
>> think it is a common enough use case to include it in the core of ES.
>>
>> Concerning the implementation of this feature I propose to introduce a
>> special _ttl field. When searching documents ES could hide expired
>> ones (maybe adding a not query for expired docs in each query or
>> something smarter). The documents could be really deleted and disk
>> space liberated during segments merges.
>>
>> What do you think?
>
>

Reply | Threaded
Open this post in threaded view
|

Re: TTL for documents

Michel Conrad
Thats exactly how I'm doing it at the moment. Although I think it
would in some cases be convenient to have the possibility to specify a
ttl while indexing, for instance if your expiring data is unregulary
and sparsely on the time range. In this case it would be difficult to
specify the timerange of the different indices, and if you want to,
say keep docs of a maximum age of 6 month, I think it would be nice to
specify the ttl for the docs while indexing, instead of periodically
iterating over the results cleaning up manually.

On Wed, Jul 27, 2011 at 10:41 AM, Shay Banon <[hidden email]> wrote:

> Yea, annoyed by it :). Btw, wanted to post another important note on
> managing expiring data. Another way of doing it, assuming it applies to the
> usecase, it to create an index per timespan, and then expire data by simply
> deleting old indices. The benefit of this usage pattern is the fact that
> deleting an index is much faster and has less strain on the system then
> deleting specific documents from an index (which will have to be merged
> out).
> For example, you could index log data into a single index, and have a ttl
> for it of 2 weeks. A better solution would be to create an index per week,
> and delete old indices that pass the 2 weeks mark.
> -shay.banon
>
> On Wed, Jul 27, 2011 at 11:22 AM, Michel Conrad
> <[hidden email]> wrote:
>>
>> I've already been wondering why I couldn't send no more message to the
>> list ;-)
>> Well anyway here is what I wanted to post:
>>
>>
>> +1
>> I would love to see this feature implemented as I am doing something
>> similar on the client side and it would be much simpler if the server
>> could take care of the expiration date.
>>
>> Best,
>> Michel
>>
>> On Wed, Jul 27, 2011 at 6:14 AM, Shay Banon
>> <[hidden email]> wrote:
>> > Heya,
>> >   Yes, its possible to add this feature. I think there is already an
>> > issue
>> > open for something similar... . Would love to hear what other people
>> > think...
>> > -shay.banon
>> >
>> > On Wed, Jul 27, 2011 at 12:48 AM, Benjamin Devèze
>> > <[hidden email]> wrote:
>> >>
>> >> A lot of documents naturally come with an expiration date. I think it
>> >> would be nice to have a built-in support for a TTL/doc (with
>> >> eventually default TTLs configurable per types/indices). I know disks
>> >> are not expensive these days but it still is a common usage to use TTL
>> >> for documents and it can be a very useful feature especially for
>> >> people using ES as a key value storage. It is a pain to let the user
>> >> trigger regularly some delete by query jobs to purge the data and I
>> >> think it is a common enough use case to include it in the core of ES.
>> >>
>> >> Concerning the implementation of this feature I propose to introduce a
>> >> special _ttl field. When searching documents ES could hide expired
>> >> ones (maybe adding a not query for expired docs in each query or
>> >> something smarter). The documents could be really deleted and disk
>> >> space liberated during segments merges.
>> >>
>> >> What do you think?
>> >
>> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: TTL for documents

Benjamin Devèze
In reply to this post by kimchy
Yeah I agree on the index per timespan approach, it is an efficient way to handle expiring data for logs and things like that but it doesn't really fit my use cases:

- the user still has to manage expired indices deletion by external jobs which is of course easy but not really nice
- it is not really flexible because you have to choose your  time range a priori. In my use case I would like to be able to dynamically change the TTL of an indexed doc and I don't want to have to delete it from an index and reindex it to another one fitting the new TTL
- that can lead to a lot of indices (think for example one index/user + divide each index by time range...) which add an overhead



Reply | Threaded
Open this post in threaded view
|

Re: TTL for documents

Benjamin Devèze
In reply to this post by kimchy
If there are other people interested and if we can agree here to a good way to implement it I am quite willing to implement it. Kimchy do you have special recommendations, concerns about the implementation?
Reply | Threaded
Open this post in threaded view
|

Re: TTL for documents

kimchy
Administrator
I think we identified two different implementations:

1. The first, is one that I have been thinking for a long time, and its automatic rolling of indices. Basically, utilizing the index templates notion, one can define an index rolling strategy (time based can be the first one). When indexing, we can check if a rollover is needed, and if so, we can create a new index and index the data into it. The fact that its built on top of index templates will automatically support custom settings and mappings for the indices created.

   This one can touch on several places in elasticsearch, and can have the additional features:
    - Automatic index naming based on rollover strategy (week / day / ...)
    - Automatically delete old indices, where old is defined in the rollover strategy.
    - Automatic setting of aliases. For example, an "indexing" alias and "search" alias, as well as possible additional search aliases ("last_week", "last_month").

2. TTL per document in the index. That one is a bit more tricky as it requires to think where the TTL will be stored. It can be stored in the document, but then it requires reindexing whenever it changes. It will also require a process that periodically evicts old documents.

On Wed, Jul 27, 2011 at 3:27 PM, Benjamin Devèze <[hidden email]> wrote:
If there are other people interested and if we can agree here to a good way to implement it I am quite willing to implement it. Kimchy do you have special recommendations, concerns about the implementation?

Reply | Threaded
Open this post in threaded view
|

Re: TTL for documents

Mahendra M
In reply to this post by kimchy
(Sending to new mailing list)

+1 to this feature. It will help in my scenario.

I have docs in CouchDB which have publication and expiry timestamps in them.
I expose ElasticSearch as a query layer to users for these docs.
I have a celery (python) job which keeps syncing (POST/DELETE) these docs to ElasticSearch at appropriate times.

Typically there is a 2 - 5 minute delay in these operations ( pubish_time + x_minutes ). It is OK for me to publish the doc to ElasticSearch with a delay, but withdrawing with a delay is a bit painful.

If a _ttl field is supported, it will make withdrawing docs easier and almost realtime.

Regards,
Mahendra


On Wed, Jul 27, 2011 at 9:44 AM, Shay Banon <[hidden email]> wrote:
Heya,

  Yes, its possible to add this feature. I think there is already an issue open for something similar... . Would love to hear what other people think...

-shay.banon


On Wed, Jul 27, 2011 at 12:48 AM, Benjamin Devèze <[hidden email]> wrote:
A lot of documents naturally come with an expiration date. I think it
would be nice to have a built-in support for a TTL/doc (with
eventually default TTLs configurable per types/indices). I know disks
are not expensive these days but it still is a common usage to use TTL
for documents and it can be a very useful feature especially for
people using ES as a key value storage. It is a pain to let the user
trigger regularly some delete by query jobs to purge the data and I
think it is a common enough use case to include it in the core of ES.

Concerning the implementation of this feature I propose to introduce a
special _ttl field. When searching documents ES could hide expired
ones (maybe adding a not query for expired docs in each query or
something smarter). The documents could be really deleted and disk
space liberated during segments merges.

What do you think?




--
Mahendra

http://twitter.com/mahendra