Filtering for apostrophes and single quotes confusion?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Filtering for apostrophes and single quotes confusion?

phill
Recently a customer of ours confused himself and us when he discovered his corpus of documents contains (at least) two different characters used for the apostrophe.
I didn't recognize the issue at first.

For those who aren't familiar with this problem, depending on the history of characters in a document an apostrophe used as a possessive in English, i.e
"customer's report"  (one report from one customer) may be one of several characters and apparently the 3.x StandardAnalyzer doesn't take this into consideration.
Did I miss a mention of a fix for this?  I was conducting tests against Lucene 3.4, but didn't see mention in ES either (I'm not running 4.x yet).

While I found some discussion of this issue over the years, I was surprised to not find any general solution either in Standard Analyzer or in some extra Filter that I might leverage in a filter chain. Am I missing something? 

I also have to say that not very large test document sets gathered as ordinary domain examples from the web have now been shown include different apostrophe characters.

My suggested solution matches the one line from the Snowball page (see below) "Clearly other codes for apostrophe can be mapped to this [apostrophe] code prior to stemming."  I'm sure I DO NOT want to mess with things too much.  I already don't use standard analyzer (so to not bother dropping stopwords), so I was thinking a simple filter chainable before standard filter that looks for odd Apostrophes followed by s and replaces the odd char with U+0027 would do the trick.

Any thoughts or help?

-Paul

*****************
All the background information I have on the topic.

Smart Editors (MS Word and I believe Adobe Acrobat) convert the ordinary single quote/apostrophe key (on the modern English MS keyboard that is (the un-shifted key on double quote and single-quote (?) key) to various other characters.  Meanwhile, neither browser web page entry boxes (by default) nor simpler text editors mess with characters typed usually resulting in an APOSTROPHE.
Paul’s example of an apostrophe and a ‘full quote’ generated by the Outlook editor.

Paul's example of an apostrophe and a 'full quote' typed into a browser field.

Assuming my e-mail editor, my e-mailer, the list mailer, your e-mail program and your viewer all preserved the characters along the way, the 1st line uses 3 different characters the 2nd uses 1.  I only typed one character in all cases.

Various character that might show up include the following:
U+0027    APOSTROPHE                  Original ASCII character, _probably_ what your keyboard sends, but I can't promise anything.
U+0091    Left single quotation mark  ASCII ISO 8859-1 ISO Latin 1(Note 1), but is listed as PRIVATE USE ONLY in official Unicode.
U+0092    Right single quotation mark ASCII ISO 8859-1 ISO Latin 1(Note 1), but is listed as PRIVATE USE ONLY in official Unicode.
U+2018    LEFT SINGLE QUOTATION MARK  The official Unicode character.  This is what I get from the above example generated in 2013.
U+2019    RIGHT SINGLE QUOTATION MARK The official Unicode character.  This is what I get from the above example generated in 2013.
U+2019        SINGLE HIGH-REVERSED-9 QUOTATION MARK  Mentioned as a special case use at Tartarus.org in other contexts, eg. O'Reilly (see link below).

Standards are just crazy things in the real world since they are never followed fully.
Typing a single quote from the keyboard into the website http://www.babelstone.co.uk/unicode/whatisit.html
using either Firefox or IE reports back that it got U+0027 - the old fashion apostrophe, but Unicode at the page for U+2019 says [U+2019] "is the preferred character to use for apostrophe". 

The Snowball parser folks spotted the problem and summarized it at:
http://snowball.tartarus.org/texts/apostrophe.html
But I didn't see any Filters there either, but maybe I didn't search well enough, but then maybe I used the wrong apostrophe when searching.

-Paul

(1) ISO 8859-1 ISO Latin 1  http://www.ascii-code.com/






--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Filtering for apostrophes and single quotes confusion?

Clinton Gormley-2
Hi Paul

Interesting...

From this:

> For these reasons, the English stemmer treats apostrophe as a letter, removing it from the beginning of a word, where it might have stood for an opening quote, from the end of the word, where it might have stood for a closing quote, or been an apostrophe following s. The form ’s is also treated as an ending. 

... it sounds like things will work correctly as long as you normalize all single quotes/apostrophes to the same character, which you can do with a char filter:



curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "quotes" : {
               "filter" : [
                  "standard",
                  "lowercase"
               ],
               "char_filter" : [
                  "quotes"
               ],
               "tokenizer" : "standard"
            }
         },
         "char_filter" : {
            "quotes" : {
               "mappings" : [
                  "\\u0091=>\u0027",
                  "\\u0092=>\u0027",
                  "\\u2018=>\u0027",
                  "\\u2019=>\u0027"
               ],
               "type" : "mapping"
            }
         }
      }
   }
}
'

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty&analyzer=quotes' -d '
Paul’s example of an apostrophe and a ‘full quote’ generated by the Outlook editor.
'

# {
#    "tokens" : [
#       {
#          "end_offset" : 6,
#          "position" : 1,
#          "start_offset" : 0,
#          "type" : "<ALPHANUM>",
#          "token" : "paul's"
#       },
#       {
#          "end_offset" : 14,
#          "position" : 2,
#          "start_offset" : 7,
#          "type" : "<ALPHANUM>",
#          "token" : "example"
#       },
#       {
#          "end_offset" : 17,
#          "position" : 3,
#          "start_offset" : 15,
#          "type" : "<ALPHANUM>",
#          "token" : "of"
#       },
#       {
#          "end_offset" : 20,
#          "position" : 4,
#          "start_offset" : 18,
#          "type" : "<ALPHANUM>",
#          "token" : "an"
#       },
#       {
#          "end_offset" : 31,
#          "position" : 5,
#          "start_offset" : 21,
#          "type" : "<ALPHANUM>",
#          "token" : "apostrophe"
#       },
#       {
#          "end_offset" : 35,
#          "position" : 6,
#          "start_offset" : 32,
#          "type" : "<ALPHANUM>",
#          "token" : "and"
#       },
#       {
#          "end_offset" : 37,
#          "position" : 7,
#          "start_offset" : 36,
#          "type" : "<ALPHANUM>",
#          "token" : "a"
#       },
#       {
#          "end_offset" : 43,
#          "position" : 8,
#          "start_offset" : 39,
#          "type" : "<ALPHANUM>",
#          "token" : "full"
#       },
#       {
#          "end_offset" : 49,
#          "position" : 9,
#          "start_offset" : 44,
#          "type" : "<ALPHANUM>",
#          "token" : "quote"
#       },
#       {
#          "end_offset" : 60,
#          "position" : 10,
#          "start_offset" : 51,
#          "type" : "<ALPHANUM>",
#          "token" : "generated"
#       },
#       {
#          "end_offset" : 63,
#          "position" : 11,
#          "start_offset" : 61,
#          "type" : "<ALPHANUM>",
#          "token" : "by"
#       },
#       {
#          "end_offset" : 67,
#          "position" : 12,
#          "start_offset" : 64,
#          "type" : "<ALPHANUM>",
#          "token" : "the"
#       },
#       {
#          "end_offset" : 75,
#          "position" : 13,
#          "start_offset" : 68,
#          "type" : "<ALPHANUM>",
#          "token" : "outlook"
#       },
#       {
#          "end_offset" : 82,
#          "position" : 14,
#          "start_offset" : 76,
#          "type" : "<ALPHANUM>",
#          "token" : "editor"
#       }
#    ]
# }

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Filtering for apostrophes and single quotes confusion?

phill
On 5/24/2013 3:22 AM, Clinton Gormley wrote:
Hi Paul

Interesting...

From this:

> For these reasons, the English stemmer treats apostrophe as a letter, removing it from the beginning of a word, where it might have stood for an opening quote, from the end of the word, where it might have stood for a closing quote, or been an apostrophe following s. The form ’s is also treated as an ending. 

... it sounds like things will work correctly as long as you normalize all single quotes/apostrophes to the same character, which you can do with a char filter:

Thanks for the response.  Your are right, the Snowball parser will be happy with just simple character replacement, so there's no need to try to identify only "xxx's" occurrences using a custom _token_ filter. I'm a little nervous that I'd throw off some other Filter, but my particular configuration is all under my control, so all is good.

Just to complete the record, while looking at the Lucene code, I did spot that the simple EnglishPossesiveFilter also thinks its worth looking for one more.
U+FF07.  In Unicode this is called FULLWIDTH APOSTROPHE, which seems more likely than the one I listed in my original list (note the URL link was right the URL text was wrong) U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK (Gosh what a name!) even if that is used in some actual non-technical published documents as mentioned on the Snowball page, but not as a possessive or a quote.

I'd suggest your example char filter should get one more entry for this fat or full-width apostrophe.

"char_filter" : {
            "quotes" : {
               "mappings" : [
                  "\\u0091=>\u0027",
                  "\\u0092=>\u0027",
                  "\\u2018=>\u0027",
                  "\\u2019=>\u0027"
                  "\\uFF07=>\u0027"
               ],
               "type" : "mapping"
            }
         }


I'd leave out all the myriad others characters that look like apostrophes which all seem to be special linguistic marks which I hope remain part of the term for any linguist processing or maybe further filtered away when no one cares.

-Paul

p.s.   The fully correct way to spell Hawaii uses one of those really special characters -- Hawaiʻi. see
http://www.fileformat.info/info/unicode/char/02BB/index.htm
"used in Hawai`ian orthography as `okina (glottal stop)"  (but that last sentence used what many folks use for glottal stop - a grave accent).
Now you too can form sentences of the form "Hawaiʻi's language orthography has it's own special characters."

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.