Finding duplicate documents or its count based on some field names

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Finding duplicate documents or its count based on some field names

narinder.izap
Hi All,

        I need to know, if Elasticsearch has some feature to find the duplicate documents or documents counts if I want to see how many documents are having same values against two or more fields. I can do that for one field using facets, but what if I need to do it against more than one field. For Example : Suppose I have following doc in Es

doc 1 : 

{
name : abc
age:22
country:usa
gender:male
}

doc 2 :

{
name:xyz
age:27
country:usa
gender:male
}

doc 3:

{
name:xyz
age:22
country:india
gender:female
}

doc 4
{
name:abc
age:22
country:usa
gender:female
}

So now my requirement is to find all doc having same age and same country, So that  doc1 and doc4 are duplicate for me, OR In  simple  words, I want to have unique clause on a single fields or composite fields key. Is this possible?? 

Please let me know if its possible using Elasticsearch, as I think it is very important feature for me.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1988ddb0-9bae-4263-b262-3c84f4445fa8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: Finding duplicate documents or its count based on some field names

joergprante@gmail.com
If you can use 1.0.0.Beta2, aggregations might be a solution.

Demo: 


Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGExLpeiBcMxcemFDo9oqOLjCj9zq4Db8_mNrp8n%2BV3-w%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: Finding duplicate documents or its count based on some field names

Ivan Brusic
More Like This could work, especially if using non-analyzed fields:


-- 
Ivan




On Sat, Dec 28, 2013 at 5:14 AM, [hidden email] <[hidden email]> wrote:
If you can use 1.0.0.Beta2, aggregations might be a solution.

Demo: 


Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGExLpeiBcMxcemFDo9oqOLjCj9zq4Db8_mNrp8n%2BV3-w%40mail.gmail.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQAmk5sUPgDd_1eA_pF%2BU%2B0ud0C9n4xVgsmHZmjvFM__gg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: Finding duplicate documents or its count based on some field names

Yann Barraud
In reply to this post by narinder.izap
Hi,

You can check this :

http://github.com/yannbrrd/elasticsearch-entity-resolution

Le samedi 28 décembre 2013 06:16:16 UTC+1, Narinder Kaur a écrit :
Hi All,

        I need to know, if Elasticsearch has some feature to find the duplicate documents or documents counts if I want to see how many documents are having same values against two or more fields. I can do that for one field using facets, but what if I need to do it against more than one field. For Example : Suppose I have following doc in Es

doc 1 : 

{
name : abc
age:22
country:usa
gender:male
}

doc 2 :

{
name:xyz
age:27
country:usa
gender:male
}

doc 3:

{
name:xyz
age:22
country:india
gender:female
}

doc 4
{
name:abc
age:22
country:usa
gender:female
}

So now my requirement is to find all doc having same age and same country, So that  doc1 and doc4 are duplicate for me, OR In  simple  words, I want to have unique clause on a single fields or composite fields key. Is this possible?? 

Please let me know if its possible using Elasticsearch, as I think it is very important feature for me.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7d4ebe89-f777-499c-a215-f794c33d88a3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Reply | Threaded
Open this post in threaded view
|

Re: Finding duplicate documents or its count based on some field names

Alexander Reelsen-2
Hey,

another very simple solution could be a terms facet, using a script field, which simply concatenates the two fields you want to check for. See http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets-terms-facet.html#_term_scripts


--Alex


On Tue, Dec 31, 2013 at 1:57 PM, Yann Barraud <[hidden email]> wrote:
Hi,

You can check this :

http://github.com/yannbrrd/elasticsearch-entity-resolution

Le samedi 28 décembre 2013 06:16:16 UTC+1, Narinder Kaur a écrit :
Hi All,

        I need to know, if Elasticsearch has some feature to find the duplicate documents or documents counts if I want to see how many documents are having same values against two or more fields. I can do that for one field using facets, but what if I need to do it against more than one field. For Example : Suppose I have following doc in Es

doc 1 : 

{
name : abc
age:22
country:usa
gender:male
}

doc 2 :

{
name:xyz
age:27
country:usa
gender:male
}

doc 3:

{
name:xyz
age:22
country:india
gender:female
}

doc 4
{
name:abc
age:22
country:usa
gender:female
}

So now my requirement is to find all doc having same age and same country, So that  doc1 and doc4 are duplicate for me, OR In  simple  words, I want to have unique clause on a single fields or composite fields key. Is this possible?? 

Please let me know if its possible using Elasticsearch, as I think it is very important feature for me.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7d4ebe89-f777-499c-a215-f794c33d88a3%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM9u46nmj7Kzx0WZ0zUJ7xeT4e00HAh8Ce7j5DrnVY4uEg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.