Filtering for apostrophes and single quotes confusion?
Recently a customer of ours confused himself and us when he
discovered his corpus of documents contains (at least) two different
characters used for the apostrophe.
I didn't recognize the issue at first.
For those who aren't familiar with this problem, depending on the
history of characters in a document an apostrophe used as a
possessive in English, i.e
"customer's report" (one report from one customer) may be one of
several characters and apparently the 3.x StandardAnalyzer doesn't
take this into consideration.
Did I miss a mention of a fix for this? I was conducting tests
against Lucene 3.4, but didn't see mention in ES either (I'm not
running 4.x yet).
While I found some discussion of this issue over the years, I was
surprised to not find any general solution either in Standard
Analyzer or in some extra Filter that I might leverage in a filter
chain. Am I missing something?
I also have to say that not very large test document sets gathered
as ordinary domain examples from the web have now been shown include
different apostrophe characters.
My suggested solution matches the one line from the Snowball page
(see below) "Clearly other codes
for apostrophe can be mapped to this [apostrophe] code prior to
stemming." I'm sure I DO NOT want to mess with things too much. I
already don't use standard analyzer (so to not bother dropping
stopwords), so I was thinking a simple filter chainable before
standard filter that looks for odd Apostrophes followed by s and
replaces the odd char with U+0027
would do the trick.
Any thoughts or help?
All the background information I have on the topic.
Smart Editors (MS Word and I believe Adobe Acrobat) convert the
ordinary single quote/apostrophe key (on the modern English MS
keyboard that is (the un-shifted key on double quote and
single-quote (?) key) to various other characters. Meanwhile,
neither browser web page entry boxes (by default) nor simpler text
editors mess with characters typed usually resulting in an
of an apostrophe and a ‘full quote’ generated by the Outlook
of an apostrophe and a 'full quote' typed into a browser field.
Assuming my e-mail editor, my e-mailer, the list mailer,
your e-mail program and your viewer all preserved the characters
along the way, the 1st line uses 3 different characters the 2nd uses
1. I only typed one character in all cases.
Various character that might show up include the following: U+0027
APOSTROPHE Original ASCII character, _probably_
what your keyboard sends, but I can't promise anything. U+0091
Left single quotation mark ASCII ISO 8859-1 ISO Latin 1(Note
1), but is listed as PRIVATE USE ONLY in official Unicode. U+0092
Right single quotation mark ASCII ISO 8859-1 ISO Latin 1(Note
1), but is listed as PRIVATE USE ONLY in official Unicode. U+2018
LEFT SINGLE QUOTATION MARK The official Unicode character.
This is what I get from the above example generated in 2013. U+2019
RIGHT SINGLE QUOTATION MARK The official Unicode
character. This is what I get from the above example generated in
SINGLE HIGH-REVERSED-9 QUOTATION MARK Mentioned as a special case
use at Tartarus.org in other contexts, eg. O'Reilly (see link below).
Standards are just crazy things in the real world since they are
never followed fully.
Typing a single quote from the keyboard into the website
using either Firefox or IE reports back that it got U+0027 - the old
fashion apostrophe, but Unicode at the page for U+2019
says [U+2019] "is the preferred character to use for apostrophe".
The Snowball parser folks spotted the problem and summarized it at: http://snowball.tartarus.org/texts/apostrophe.html
But I didn't see any Filters there either, but maybe I didn't search
well enough, but then maybe I used the wrong apostrophe when
Re: Filtering for apostrophes and single quotes confusion?
> For these reasons, the English stemmer treats apostrophe as a letter, removing it from the beginning of a word, where it might have stood for an opening quote, from the end of the word, where it might have stood for a closing quote, or been an apostrophe following s. The form ’s is also treated as an ending.
... it sounds like things will work correctly as long as you normalize all single quotes/apostrophes to the same character, which you can do with a char filter:
Re: Filtering for apostrophes and single quotes confusion?
On 5/24/2013 3:22 AM, Clinton Gormley
For these reasons, the English stemmer treats apostrophe as
a letter, removing it from the beginning of a word, where it
might have stood for an opening quote, from the end of the
word, where it might have stood for a closing quote, or been
an apostrophe following s.
The form ’s is
also treated as an ending.
... it sounds like things will work correctly as long as you
normalize all single quotes/apostrophes to the same character,
which you can do with a char filter:
Thanks for the response. Your are right, the Snowball parser will
be happy with just simple character replacement, so there's no need
to try to identify only "xxx's" occurrences using a custom _token_
filter. I'm a little nervous that I'd throw off some other Filter,
but my particular configuration is all under my control, so all is
Just to complete the record, while looking at the Lucene code, I did
spot that the simple EnglishPossesiveFilter also thinks its worth
looking for one more.
U+FF07. In Unicode this is called FULLWIDTH APOSTROPHE, which seems
more likely than the one I listed in my original list (note the URL
link was right the URL text was wrong) U+201B SINGLE HIGH-REVERSED-9
QUOTATION MARK (Gosh what a name!) even if that is used in some
actual non-technical published documents as mentioned on the
Snowball page, but not as a possessive or a quote.
I'd suggest your example char filter should get one more entry for
this fat or full-width apostrophe.
I'd leave out all the myriad others characters that look like
apostrophes which all seem to be special linguistic marks which I
hope remain part of the term for any linguist processing or maybe
further filtered away when no one cares.
p.s. The fully correct way to spell Hawaii uses one of those
really special characters -- Hawaiʻi. see http://www.fileformat.info/info/unicode/char/02BB/index.htm
"used in Hawai`ian orthography as `okina (glottal stop)" (but that
last sentence used what many folks use for glottal stop - a grave
Now you too can form sentences of the form "Hawaiʻi's language
orthography has it's own special characters."
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.