Confused about Segments: Searchable vs Committed vs Uncommitted

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Confused about Segments: Searchable vs Committed vs Uncommitted

Zachary Tong

I'll preface this question by saying it is purely academic.  I thought these terms meant one thing, but upon watching a live index I'm no longer sure.  Are the following definitions correct?

  • Searchable: Segment is on disk as a Lucene segment and is marked as searchable.  A segment is marked searchable by the periodic refresh_interval or by the Refresh API.
  • Committed: Segment is on disk as a Lucene segment, but has not been marked as searchable yet. A segment is committed to disk when the translog defaults are reached (5000 ops, 200mb or 30min, whichever comes first), or by the Flush API.
  • Uncommitted: Segment (or operations?) live only in the translog and have no been written to a Lucene segment yet.
With those definitions in mind, I started looking at a live index (default settings) and was surprised to see something like this:

I verified these graphs with the raw Segments API, to make sure it wasn't my plugin that was being odd.  The presence of very large Uncommitted segments (1 million docs) that are very long lived (_130 at the far left was very old and persistent) confuses me.  Ditto for Committed segments...shouldn't those be changed search:true every second under default settings?

I created a simpler index with 1 shard, 0 replicas and repeated the experiment:

The results are similar, where there are relatively large segments (>5000 translog limit) that remain uncommitted.  This is under heavy indexing load from a JMeter benchmark.  If I stop the indexing, this index will eventually switch over to fully Searchable.  Even stranger, if I add a replica while continuing to index, the segments are all marked searchable as soon as the replica is initialized:



After seeing these graphs, I'm convinced I have no idea what a Searchable/Committed/Uncommitted segment actually is.  Could someone shed some light on where I'm misunderstanding?  Thanks!
-Zach

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Confused about Segments: Searchable vs Committed vs Uncommitted

Zachary Tong
Hmm, perhaps I'm misunderstanding the results of the Segments API.  I assumed that:

search: true && committed: true == Searchable
search: false && committed: true == Committed
search: false && committed: false == Uncommitted
search: true && committed: false == Uncommitted (although I'm not sure this case ever happens...searchable but not committed?)

-Zach



On Thursday, February 14, 2013 9:12:55 AM UTC-5, Zachary Tong wrote:

I'll preface this question by saying it is purely academic.  I thought these terms meant one thing, but upon watching a live index I'm no longer sure.  Are the following definitions correct?

  • Searchable: Segment is on disk as a Lucene segment and is marked as searchable.  A segment is marked searchable by the periodic refresh_interval or by the Refresh API.
  • Committed: Segment is on disk as a Lucene segment, but has not been marked as searchable yet. A segment is committed to disk when the translog defaults are reached (5000 ops, 200mb or 30min, whichever comes first), or by the Flush API.
  • Uncommitted: Segment (or operations?) live only in the translog and have no been written to a Lucene segment yet.
With those definitions in mind, I started looking at a live index (default settings) and was surprised to see something like this:

I verified these graphs with the raw Segments API, to make sure it wasn't my plugin that was being odd.  The presence of very large Uncommitted segments (1 million docs) that are very long lived (_130 at the far left was very old and persistent) confuses me.  Ditto for Committed segments...shouldn't those be changed search:true every second under default settings?

I created a simpler index with 1 shard, 0 replicas and repeated the experiment:

The results are similar, where there are relatively large segments (>5000 translog limit) that remain uncommitted.  This is under heavy indexing load from a JMeter benchmark.  If I stop the indexing, this index will eventually switch over to fully Searchable.  Even stranger, if I add a replica while continuing to index, the segments are all marked searchable as soon as the replica is initialized:



After seeing these graphs, I'm convinced I have no idea what a Searchable/Committed/Uncommitted segment actually is.  Could someone shed some light on where I'm misunderstanding?  Thanks!
-Zach

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Confused about Segments: Searchable vs Committed vs Uncommitted

Zachary Tong
One-week-later bump to see if anyone can clear this up for me =)

Cheers,
-Zach



On Thursday, February 14, 2013 9:19:57 AM UTC-5, Zachary Tong wrote:
Hmm, perhaps I'm misunderstanding the results of the Segments API.  I assumed that:

search: true && committed: true == Searchable
search: false && committed: true == Committed
search: false && committed: false == Uncommitted
search: true && committed: false == Uncommitted (although I'm not sure this case ever happens...searchable but not committed?)

-Zach



On Thursday, February 14, 2013 9:12:55 AM UTC-5, Zachary Tong wrote:

I'll preface this question by saying it is purely academic.  I thought these terms meant one thing, but upon watching a live index I'm no longer sure.  Are the following definitions correct?

  • Searchable: Segment is on disk as a Lucene segment and is marked as searchable.  A segment is marked searchable by the periodic refresh_interval or by the Refresh API.
  • Committed: Segment is on disk as a Lucene segment, but has not been marked as searchable yet. A segment is committed to disk when the translog defaults are reached (5000 ops, 200mb or 30min, whichever comes first), or by the Flush API.
  • Uncommitted: Segment (or operations?) live only in the translog and have no been written to a Lucene segment yet.
With those definitions in mind, I started looking at a live index (default settings) and was surprised to see something like this:

I verified these graphs with the raw Segments API, to make sure it wasn't my plugin that was being odd.  The presence of very large Uncommitted segments (1 million docs) that are very long lived (_130 at the far left was very old and persistent) confuses me.  Ditto for Committed segments...shouldn't those be changed search:true every second under default settings?

I created a simpler index with 1 shard, 0 replicas and repeated the experiment:

The results are similar, where there are relatively large segments (>5000 translog limit) that remain uncommitted.  This is under heavy indexing load from a JMeter benchmark.  If I stop the indexing, this index will eventually switch over to fully Searchable.  Even stranger, if I add a replica while continuing to index, the segments are all marked searchable as soon as the replica is initialized:



After seeing these graphs, I'm convinced I have no idea what a Searchable/Committed/Uncommitted segment actually is.  Could someone shed some light on where I'm misunderstanding?  Thanks!
-Zach

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Confused about Segments: Searchable vs Committed vs Uncommitted

simonw-2
hey,

I didn't go through all your graphs but let me explain how this works on a lower level in lucene... so an index (lucene index) consists of multiple segments. Segments are written by flushing ram buffers to disk (lucene flush not ES) or by merge processes. Now if you commit a lucene index you 1. flush everything to disk and 2. write a commit point (listing all segments belong to this commit) 3. calls fsync. If you open a new IndexReader on this commit all its segments are "searchable"
Now ES uses a feature called NRT (near realtime) that is similar to a commit since it flushes to disc (ES refresh) but doesn't fsync nor does it write a commit point. You can open a NRT Reader on top of an uncommitted index so those segments can be searchable (not sure if I like this term). I think the "uncommitted" part corresponds to not yet flushed into a segment on the lucene level which means its still in memory and written to the translog.

hope this clarifies it a bit.

simon

On Wednesday, February 20, 2013 5:09:00 PM UTC+1, Zachary Tong wrote:
One-week-later bump to see if anyone can clear this up for me =)

Cheers,
-Zach



On Thursday, February 14, 2013 9:19:57 AM UTC-5, Zachary Tong wrote:
Hmm, perhaps I'm misunderstanding the results of the Segments API.  I assumed that:

search: true && committed: true == Searchable
search: false && committed: true == Committed
search: false && committed: false == Uncommitted
search: true && committed: false == Uncommitted (although I'm not sure this case ever happens...searchable but not committed?)

-Zach



On Thursday, February 14, 2013 9:12:55 AM UTC-5, Zachary Tong wrote:

I'll preface this question by saying it is purely academic.  I thought these terms meant one thing, but upon watching a live index I'm no longer sure.  Are the following definitions correct?

  • Searchable: Segment is on disk as a Lucene segment and is marked as searchable.  A segment is marked searchable by the periodic refresh_interval or by the Refresh API.
  • Committed: Segment is on disk as a Lucene segment, but has not been marked as searchable yet. A segment is committed to disk when the translog defaults are reached (5000 ops, 200mb or 30min, whichever comes first), or by the Flush API.
  • Uncommitted: Segment (or operations?) live only in the translog and have no been written to a Lucene segment yet.
With those definitions in mind, I started looking at a live index (default settings) and was surprised to see something like this:

I verified these graphs with the raw Segments API, to make sure it wasn't my plugin that was being odd.  The presence of very large Uncommitted segments (1 million docs) that are very long lived (_130 at the far left was very old and persistent) confuses me.  Ditto for Committed segments...shouldn't those be changed search:true every second under default settings?

I created a simpler index with 1 shard, 0 replicas and repeated the experiment:

The results are similar, where there are relatively large segments (>5000 translog limit) that remain uncommitted.  This is under heavy indexing load from a JMeter benchmark.  If I stop the indexing, this index will eventually switch over to fully Searchable.  Even stranger, if I add a replica while continuing to index, the segments are all marked searchable as soon as the replica is initialized:



After seeing these graphs, I'm convinced I have no idea what a Searchable/Committed/Uncommitted segment actually is.  Could someone shed some light on where I'm misunderstanding?  Thanks!
-Zach

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Confused about Segments: Searchable vs Committed vs Uncommitted

Zachary Tong
Thanks for the help Simon!  I've spent all morning digging through the ES source code (segments method in RobinEngine.java) trying to fully wrap my head around what's going on.  Can you confirm that my logic is now correct?
  1. Foreach segment in NRT IndexReader (because SearchManager was built with an IndexWriter instead of Directory) 
    • segment.search = true
    • Add to `segments`
  2. Foreach segment in lastCommittedSegmentInfos 
    • If not in `segments`
      • search = false
      • committed = true
      • Add to `segments`
    • Else
      • committed = true
      • (and search = true because it was set in step 1)

Step 1 finds all segments that exist in the NRT IndexReader, marks them {search:true}.  Step 2 finds all segments at the last Commit point, and if they are not in our segment list yet mark them {search:false, committed:true}.  If they do exist, they are both in the IndexReader and committed, so mark both search/committed as true;

I can see why "Searchable" is a misnomer then, what it really refers to is whether the segmented is represented in the IndexReader yet.  Committed simply refers to the "Lucene Committed" status.

Thanks for the help!
-Zach



On Wednesday, February 20, 2013 5:50:37 PM UTC-5, simonw wrote:
hey,

I didn't go through all your graphs but let me explain how this works on a lower level in lucene... so an index (lucene index) consists of multiple segments. Segments are written by flushing ram buffers to disk (lucene flush not ES) or by merge processes. Now if you commit a lucene index you 1. flush everything to disk and 2. write a commit point (listing all segments belong to this commit) 3. calls fsync. If you open a new IndexReader on this commit all its segments are "searchable"
Now ES uses a feature called NRT (near realtime) that is similar to a commit since it flushes to disc (ES refresh) but doesn't fsync nor does it write a commit point. You can open a NRT Reader on top of an uncommitted index so those segments can be searchable (not sure if I like this term). I think the "uncommitted" part corresponds to not yet flushed into a segment on the lucene level which means its still in memory and written to the translog.

hope this clarifies it a bit.

simon

On Wednesday, February 20, 2013 5:09:00 PM UTC+1, Zachary Tong wrote:
One-week-later bump to see if anyone can clear this up for me =)

Cheers,
-Zach



On Thursday, February 14, 2013 9:19:57 AM UTC-5, Zachary Tong wrote:
Hmm, perhaps I'm misunderstanding the results of the Segments API.  I assumed that:

search: true && committed: true == Searchable
search: false && committed: true == Committed
search: false && committed: false == Uncommitted
search: true && committed: false == Uncommitted (although I'm not sure this case ever happens...searchable but not committed?)

-Zach



On Thursday, February 14, 2013 9:12:55 AM UTC-5, Zachary Tong wrote:

I'll preface this question by saying it is purely academic.  I thought these terms meant one thing, but upon watching a live index I'm no longer sure.  Are the following definitions correct?

  • Searchable: Segment is on disk as a Lucene segment and is marked as searchable.  A segment is marked searchable by the periodic refresh_interval or by the Refresh API.
  • Committed: Segment is on disk as a Lucene segment, but has not been marked as searchable yet. A segment is committed to disk when the translog defaults are reached (5000 ops, 200mb or 30min, whichever comes first), or by the Flush API.
  • Uncommitted: Segment (or operations?) live only in the translog and have no been written to a Lucene segment yet.
With those definitions in mind, I started looking at a live index (default settings) and was surprised to see something like this:

I verified these graphs with the raw Segments API, to make sure it wasn't my plugin that was being odd.  The presence of very large Uncommitted segments (1 million docs) that are very long lived (_130 at the far left was very old and persistent) confuses me.  Ditto for Committed segments...shouldn't those be changed search:true every second under default settings?

I created a simpler index with 1 shard, 0 replicas and repeated the experiment:

The results are similar, where there are relatively large segments (>5000 translog limit) that remain uncommitted.  This is under heavy indexing load from a JMeter benchmark.  If I stop the indexing, this index will eventually switch over to fully Searchable.  Even stranger, if I add a replica while continuing to index, the segments are all marked searchable as soon as the replica is initialized:



After seeing these graphs, I'm convinced I have no idea what a Searchable/Committed/Uncommitted segment actually is.  Could someone shed some light on where I'm misunderstanding?  Thanks!
-Zach

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Confused about Segments: Searchable vs Committed vs Uncommitted

kimchy
Administrator
Yea, thats what it means. 

On Feb 21, 2013, at 6:08 PM, Zachary Tong <[hidden email]> wrote:

Thanks for the help Simon!  I've spent all morning digging through the ES source code (segments method in RobinEngine.java) trying to fully wrap my head around what's going on.  Can you confirm that my logic is now correct?
  1. Foreach segment in NRT IndexReader (because SearchManager was built with an IndexWriter instead of Directory) 
    • segment.search = true
    • Add to `segments`
  2. Foreach segment in lastCommittedSegmentInfos 
    • If not in `segments`
      • search = false
      • committed = true
      • Add to `segments`
    • Else
      • committed = true
      • (and search = true because it was set in step 1)

Step 1 finds all segments that exist in the NRT IndexReader, marks them {search:true}.  Step 2 finds all segments at the last Commit point, and if they are not in our segment list yet mark them {search:false, committed:true}.  If they do exist, they are both in the IndexReader and committed, so mark both search/committed as true;

I can see why "Searchable" is a misnomer then, what it really refers to is whether the segmented is represented in the IndexReader yet.  Committed simply refers to the "Lucene Committed" status.

Thanks for the help!
-Zach



On Wednesday, February 20, 2013 5:50:37 PM UTC-5, simonw wrote:
hey,

I didn't go through all your graphs but let me explain how this works on a lower level in lucene... so an index (lucene index) consists of multiple segments. Segments are written by flushing ram buffers to disk (lucene flush not ES) or by merge processes. Now if you commit a lucene index you 1. flush everything to disk and 2. write a commit point (listing all segments belong to this commit) 3. calls fsync. If you open a new IndexReader on this commit all its segments are "searchable"
Now ES uses a feature called NRT (near realtime) that is similar to a commit since it flushes to disc (ES refresh) but doesn't fsync nor does it write a commit point. You can open a NRT Reader on top of an uncommitted index so those segments can be searchable (not sure if I like this term). I think the "uncommitted" part corresponds to not yet flushed into a segment on the lucene level which means its still in memory and written to the translog.

hope this clarifies it a bit.

simon

On Wednesday, February 20, 2013 5:09:00 PM UTC+1, Zachary Tong wrote:
One-week-later bump to see if anyone can clear this up for me =)

Cheers,
-Zach



On Thursday, February 14, 2013 9:19:57 AM UTC-5, Zachary Tong wrote:
Hmm, perhaps I'm misunderstanding the results of the Segments API.  I assumed that:

search: true && committed: true == Searchable
search: false && committed: true == Committed
search: false && committed: false == Uncommitted
search: true && committed: false == Uncommitted (although I'm not sure this case ever happens...searchable but not committed?)

-Zach



On Thursday, February 14, 2013 9:12:55 AM UTC-5, Zachary Tong wrote:

I'll preface this question by saying it is purely academic.  I thought these terms meant one thing, but upon watching a live index I'm no longer sure.  Are the following definitions correct?

  • Searchable: Segment is on disk as a Lucene segment and is marked as searchable.  A segment is marked searchable by the periodic refresh_interval or by the Refresh API.
  • Committed: Segment is on disk as a Lucene segment, but has not been marked as searchable yet. A segment is committed to disk when the translog defaults are reached (5000 ops, 200mb or 30min, whichever comes first), or by the Flush API.
  • Uncommitted: Segment (or operations?) live only in the translog and have no been written to a Lucene segment yet.
With those definitions in mind, I started looking at a live index (default settings) and was surprised to see something like this:

I verified these graphs with the raw Segments API, to make sure it wasn't my plugin that was being odd.  The presence of very large Uncommitted segments (1 million docs) that are very long lived (_130 at the far left was very old and persistent) confuses me.  Ditto for Committed segments...shouldn't those be changed search:true every second under default settings?

I created a simpler index with 1 shard, 0 replicas and repeated the experiment:

The results are similar, where there are relatively large segments (>5000 translog limit) that remain uncommitted.  This is under heavy indexing load from a JMeter benchmark.  If I stop the indexing, this index will eventually switch over to fully Searchable.  Even stranger, if I add a replica while continuing to index, the segments are all marked searchable as soon as the replica is initialized:



After seeing these graphs, I'm convinced I have no idea what a Searchable/Committed/Uncommitted segment actually is.  Could someone shed some light on where I'm misunderstanding?  Thanks!
-Zach


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.