Difference in analyzer between 1.3.4 and 0.20.2

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Difference in analyzer between 1.3.4 and 0.20.2

Ben George
I am in process of upgrading ES from 0.20.2 to 1.3.4.  Below are two requests to test an analyzer / filter, and although the mapping files are semantically the same the results are slightly different.  

Can anyone provide some insight as to why the differ (the start_offest, end_offset and position) ?  Also does it matter ?  The reason I noticed this is because I'm trying to debug some unexpected behaviour with a query where the result set for "a" are same for "aa" or even "axxxxxxxxxx".

The filter config is:

                "filter_edge_ngram_front": {
                    "type": "edgeNGram",
                    "max_gram": "20",
                    "min_gram": "1",
                    "side": "front"
                }

v.20.2/_analyze?text=aa+b&filters=filter_edge_ngram_front&tokenizer=standard

{
  • tokens
    [
    • {
      • token"a",
      • start_offset0,
      • end_offset1,
      • type"word",
      • position1
      },
    • {
      • token"aa",
      • start_offset0,
      • end_offset2,
      • type"word",
      • position2
      },
    • {
      • token"b",
      • start_offset3,
      • end_offset4,
      • type"word",
      • position3
      }
    ]
}

v1.3.4/_analyze?text=aa+b&filters=filter_edge_ngram_front&tokenizer=standard

{
  • tokens
    [
    • {
      • token"a",
      • start_offset0,
      • end_offset2,
      • type"word",
      • position1
      },
    • {
      • token"aa",
      • start_offset0,
      • end_offset2,
      • type"word",
      • position1
      },
    • {
      • token"b",
      • start_offset3,
      • end_offset4,
      • type"word",
      • position2
      }
    ]
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2e58239a-0091-4d8b-872a-e5b5414b72ad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Difference in analyzer between 1.3.4 and 0.20.2

simonw-2
We fixed EdgeNGram tokenizer / filter in the 1.x series but don't ask me when exactly I think it was lucene 4.4 or so. Those offsets are now correct while they where broken before.
not sure if this helps you to debug your problem

On Thursday, November 6, 2014 1:31:22 PM UTC+1, Ben George wrote:
I am in process of upgrading ES from 0.20.2 to 1.3.4.  Below are two requests to test an analyzer / filter, and although the mapping files are semantically the same the results are slightly different.  

Can anyone provide some insight as to why the differ (the start_offest, end_offset and position) ?  Also does it matter ?  The reason I noticed this is because I'm trying to debug some unexpected behaviour with a query where the result set for "a" are same for "aa" or even "axxxxxxxxxx".

The filter config is:

                "filter_edge_ngram_front": {
                    "type": "edgeNGram",
                    "max_gram": "20",
                    "min_gram": "1",
                    "side": "front"
                }

v.20.2/_analyze?text=aa+b&filters=filter_edge_ngram_front&tokenizer=standard

{
  • tokens
    [
    • {
      • token"a",
      • start_offset0,
      • end_offset1,
      • type"word",
      • position1
      },
    • {
      • token"aa",
      • start_offset0,
      • end_offset2,
      • type"word",
      • position2
      },
    • {
      • token"b",
      • start_offset3,
      • end_offset4,
      • type"word",
      • position3
      }
    ]
}

v1.3.4/_analyze?text=aa+b&filters=filter_edge_ngram_front&tokenizer=standard

{
  • tokens
    [
    • {
      • token"a",
      • start_offset0,
      • end_offset2,
      • type"word",
      • position1
      },
    • {
      • token"aa",
      • start_offset0,
      • end_offset2,
      • type"word",
      • position1
      },
    • {
      • token"b",
      • start_offset3,
      • end_offset4,
      • type"word",
      • position2
      }
    ]
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ab062e84-b429-40d7-bb8b-bb94e9ec9316%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.