html strip question

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

html strip question

evvo1961
This post has NOT been accepted by the mailing list yet.
I am new to learning elasticsearch and am trying to work out how to not have html tags in the indexed content.

I have created a small index and added a few items of content. one of the items has its content wrapped in a strong tag, and when i do a search for the word strong this item appears in the results. while it makes sense to have the source retaining its html for display i dont want this content returning in results for searches on strong if the only occurrence of this is the html tag.

my basic code is as follows...

index creation

PUT /my_index
{
    "settings": {
        "mappings": {
    "post": {
      "properties": {
        "user_id": {
          "type": "integer"
        },
        "post_text": {
          "type": "string",
          "analyzer": "whitespace"
        },
        "post_date": {
          "type": "date"
        },
        "post_word_count": {
          "type": "integer"
        }
      }
    }
  },
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type":       "mapping",
                    "mappings": [ "&=> and "]
            }},
            "filter": {
                "my_stopwords": {
                    "type":       "stop",
                    "stopwords": [ "the", "a" ]
            }},
            "analyzer": {
                "my_analyzer": {
                    "type":         "custom",
                    "char_filter":  [ "html_strip", "&_to_and" ],
                    "tokenizer":    "standard",
                    "filter":       [ "lowercase", "my_stopwords" ]
            }}
}}}

i then add a few items which display in marvel search as:-

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 4,
      "max_score": 1,
      "hits": [
         {
            "_index": "my_index",
            "_type": "post",
            "_id": "AU-x2iYGSPpIjML6_aFi",
            "_score": 1,
            "_source": {
               "userId": 1,
               "postDate": "2015-09-09T12:25:07.9745609+01:00",
               "postText": "This is a blog post."
            }
         },
         {
            "_index": "my_index",
            "_type": "post",
            "_id": "AU-x2iYySPpIjML6_aFj",
            "_score": 1,
            "_source": {
               "userId": 1,
               "postDate": "2015-09-09T12:25:07.9745609+01:00",
               "postText": "This is another blog post."
            }
         },
         {
            "_index": "my_index",
            "_type": "post",
            "_id": "AU-x2iY0SPpIjML6_aFk",
            "_score": 1,
            "_source": {
               "userId": 2,
               "postDate": "2015-09-09T12:25:07.9745609+01:00",
               "postText": "This is a third blog post."
            }
         },
         {
            "_index": "my_index",
            "_type": "post",
            "_id": "AU-x2iY1SPpIjML6_aFl",
            "_score": 1,
            "_source": {
               "userId": 2,
               "postDate": "2015-09-14T12:25:07.9745609+01:00",
               "postText": "<strong>This is a blog post from the future.</strong>"
            }
         }
      ]
   }
}

would appreciate any advice to achieve what i want whereby the 4th item will not return in results for the word strong