Documents indexed, but cannot 'GET' them.

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Documents indexed, but cannot 'GET' them.

mp2893

Hi,

I have a index named 'news' and a mapping named 'document'
'news' has 5 shards, 2 replicates.
Below is the mapping of 'document'
---------------------------------------------------------------------------------
{
  "document" : {
    "_parent" : {
      "type" : "cluster"
    },
    "_routing" : {
      "required" : true
    },
    "_source" : {
      "enabled" : false
    },
    "properties" : {
      "clusterid" : {
        "type" : "string",
        "store" : "yes"
      },
      "company" : {
        "type" : "string"
      },
      "companyNum" : {
        "type" : "string"
      },
      "count" : {
        "type" : "integer"
      },
      "date" : {
        "type" : "date",
        "store" : "yes",
        "format" : "YYYY-MM-dd"
      },
      "text" : {
        "type" : "string",
        "analyzer" : "snowball",
        "term_vector" : "with_positions_offsets"
      },
      "title" : {
        "type" : "string",
        "boost" : 2.0,
        "analyzer" : "snowball",
        "store" : "yes",
        "term_vector" : "with_positions_offsets"
      },
      "url" : {
        "type" : "string"
      }
    }
  }
}
-------------------------------------------------------------------------------------
Don't mind the '_parent'. I don't think that is relevant.

I bulk-indexed 427410 json-documents. I got no errors. Everything went
fine.
Below is the code for bulk indexing (partial code actually)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    public void insertDocumentBulk(ArrayList<String> lineList, String
documentMapping) throws Exception
    {
        BulkRequestBuilder brb = client.prepareBulk();

        for (String line: lineList)
        {
            JSONObject jobj = new JSONObject(line);
            String id = jobj.getString("docid");
            String clusterId = jobj.getString("clusterid");
            String dateFormat =
convertDateFormat(jobj.getInt("date"));
            JSONArray sentArray = jobj.getJSONArray("text");
            String text = jsonArrayToString(sentArray);
            jobj.put("text", text);
            jobj.put("date", dateFormat);
            jobj.remove("docid");
            brb.add(client.prepareIndex(index, documentMapping,
id).setParent(clusterId).setSource(jobj.toString()));
        }

        BulkResponse bulkResponse = brb.execute().actionGet();

        int count = 0;
        if (bulkResponse.hasFailures())
        {
            count++;
        }
        System.out.println("error count: " + count);
    }
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
But when I try to 'GET' some of the documents, I get nothing.
For example, if I do a query "curl -XGET 'http://etridorm.iptime.org:
9200/news/document/20110803_0_96464759519569618?fields=title'",
I get
"{"_index":"news","_type":"document","_id":"20110803_0_96464759519569618","exists":false}".

It's not that I can't 'GET' all the documents. Some of them are
accessible, but some of them aren't.
But the funny thing is, when I do a query "curl -XGET 'http://
etridorm.iptime.org:9200/news/document/_count?q=*'"
I get {"count":427410,"_shards":{"total":5,"successful":5,"failed":
0}}", which is the exact same number of documents I had indexed.

I did flushing (curl -XPOST 'http://etridorm.iptime.org:9200/news/
document/_flush),
I did refreshing (curl -XPOST 'http://etridorm.iptime.org:9200/news/
document/_refresh),
but nothing seemed to improve the situation.

Am I doing something wrong here?
I'd appreciate any help.

Ed
Reply | Threaded
Open this post in threaded view
|

Re: Documents indexed, but cannot 'GET' them.

mp2893

I've found the answer to my problem. (as it is often the case)
After you index a document with the "_parent" set, if you want to "GET" the document, you need to specify the "routing" parameter.
For example, if you index a document with an id "1234" and its parent "5678", then the "GET" command should be,
Hope this helps someone like me.

2012/4/9 mp2893 <[hidden email]>

Hi,

I have a index named 'news' and a mapping named 'document'
'news' has 5 shards, 2 replicates.
Below is the mapping of 'document'
---------------------------------------------------------------------------------
{
 "document" : {
   "_parent" : {
     "type" : "cluster"
   },
   "_routing" : {
     "required" : true
   },
   "_source" : {
     "enabled" : false
   },
   "properties" : {
     "clusterid" : {
       "type" : "string",
       "store" : "yes"
     },
     "company" : {
       "type" : "string"
     },
     "companyNum" : {
       "type" : "string"
     },
     "count" : {
       "type" : "integer"
     },
     "date" : {
       "type" : "date",
       "store" : "yes",
       "format" : "YYYY-MM-dd"
     },
     "text" : {
       "type" : "string",
       "analyzer" : "snowball",
       "term_vector" : "with_positions_offsets"
     },
     "title" : {
       "type" : "string",
       "boost" : 2.0,
       "analyzer" : "snowball",
       "store" : "yes",
       "term_vector" : "with_positions_offsets"
     },
     "url" : {
       "type" : "string"
     }
   }
 }
}
-------------------------------------------------------------------------------------
Don't mind the '_parent'. I don't think that is relevant.

I bulk-indexed 427410 json-documents. I got no errors. Everything went
fine.
Below is the code for bulk indexing (partial code actually)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   public void insertDocumentBulk(ArrayList<String> lineList, String
documentMapping) throws Exception
   {
       BulkRequestBuilder brb = client.prepareBulk();

       for (String line: lineList)
       {
           JSONObject jobj = new JSONObject(line);
           String id = jobj.getString("docid");
           String clusterId = jobj.getString("clusterid");
           String dateFormat =
convertDateFormat(jobj.getInt("date"));
           JSONArray sentArray = jobj.getJSONArray("text");
           String text = jsonArrayToString(sentArray);
           jobj.put("text", text);
           jobj.put("date", dateFormat);
           jobj.remove("docid");
           brb.add(client.prepareIndex(index, documentMapping,
id).setParent(clusterId).setSource(jobj.toString()));
       }

       BulkResponse bulkResponse = brb.execute().actionGet();

       int count = 0;
       if (bulkResponse.hasFailures())
       {
           count++;
       }
       System.out.println("error count: " + count);
   }
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
But when I try to 'GET' some of the documents, I get nothing.
For example, if I do a query "curl -XGET 'http://etridorm.iptime.org:
9200/news/document/20110803_0_96464759519569618?fields=title'",
I get
"{"_index":"news","_type":"document","_id":"20110803_0_96464759519569618","exists":false}".

It's not that I can't 'GET' all the documents. Some of them are
accessible, but some of them aren't.
But the funny thing is, when I do a query "curl -XGET 'http://
etridorm.iptime.org:9200/news/document/_count?q=*'"
I get {"count":427410,"_shards":{"total":5,"successful":5,"failed":
0}}", which is the exact same number of documents I had indexed.

I did flushing (curl -XPOST '<a href="http://etridorm.iptime.org:9200/news/ document/_flush" target="_blank">http://etridorm.iptime.org:9200/news/
document/_flush),
I did refreshing (curl -XPOST '<a href="http://etridorm.iptime.org:9200/news/ document/_refresh" target="_blank">http://etridorm.iptime.org:9200/news/
document/_refresh),
but nothing seemed to improve the situation.

Am I doing something wrong here?
I'd appreciate any help.

Ed

Reply | Threaded
Open this post in threaded view
|

Re: Documents indexed, but cannot 'GET' them.

kimchy
Administrator
Yea, thats because the child document is routed based on the parent id so they end up in the same shard.

On Tue, Apr 10, 2012 at 3:14 AM, edward choi <[hidden email]> wrote:

I've found the answer to my problem. (as it is often the case)
After you index a document with the "_parent" set, if you want to "GET" the document, you need to specify the "routing" parameter.
For example, if you index a document with an id "1234" and its parent "5678", then the "GET" command should be,
Hope this helps someone like me.

2012/4/9 mp2893 <[hidden email]>

Hi,

I have a index named 'news' and a mapping named 'document'
'news' has 5 shards, 2 replicates.
Below is the mapping of 'document'
---------------------------------------------------------------------------------
{
 "document" : {
   "_parent" : {
     "type" : "cluster"
   },
   "_routing" : {
     "required" : true
   },
   "_source" : {
     "enabled" : false
   },
   "properties" : {
     "clusterid" : {
       "type" : "string",
       "store" : "yes"
     },
     "company" : {
       "type" : "string"
     },
     "companyNum" : {
       "type" : "string"
     },
     "count" : {
       "type" : "integer"
     },
     "date" : {
       "type" : "date",
       "store" : "yes",
       "format" : "YYYY-MM-dd"
     },
     "text" : {
       "type" : "string",
       "analyzer" : "snowball",
       "term_vector" : "with_positions_offsets"
     },
     "title" : {
       "type" : "string",
       "boost" : 2.0,
       "analyzer" : "snowball",
       "store" : "yes",
       "term_vector" : "with_positions_offsets"
     },
     "url" : {
       "type" : "string"
     }
   }
 }
}
-------------------------------------------------------------------------------------
Don't mind the '_parent'. I don't think that is relevant.

I bulk-indexed 427410 json-documents. I got no errors. Everything went
fine.
Below is the code for bulk indexing (partial code actually)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   public void insertDocumentBulk(ArrayList<String> lineList, String
documentMapping) throws Exception
   {
       BulkRequestBuilder brb = client.prepareBulk();

       for (String line: lineList)
       {
           JSONObject jobj = new JSONObject(line);
           String id = jobj.getString("docid");
           String clusterId = jobj.getString("clusterid");
           String dateFormat =
convertDateFormat(jobj.getInt("date"));
           JSONArray sentArray = jobj.getJSONArray("text");
           String text = jsonArrayToString(sentArray);
           jobj.put("text", text);
           jobj.put("date", dateFormat);
           jobj.remove("docid");
           brb.add(client.prepareIndex(index, documentMapping,
id).setParent(clusterId).setSource(jobj.toString()));
       }

       BulkResponse bulkResponse = brb.execute().actionGet();

       int count = 0;
       if (bulkResponse.hasFailures())
       {
           count++;
       }
       System.out.println("error count: " + count);
   }
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
But when I try to 'GET' some of the documents, I get nothing.
For example, if I do a query "curl -XGET 'http://etridorm.iptime.org:
9200/news/document/20110803_0_96464759519569618?fields=title'",
I get
"{"_index":"news","_type":"document","_id":"20110803_0_96464759519569618","exists":false}".

It's not that I can't 'GET' all the documents. Some of them are
accessible, but some of them aren't.
But the funny thing is, when I do a query "curl -XGET 'http://
etridorm.iptime.org:9200/news/document/_count?q=*'"
I get {"count":427410,"_shards":{"total":5,"successful":5,"failed":
0}}", which is the exact same number of documents I had indexed.

I did flushing (curl -XPOST 'http://etridorm.iptime.org:9200/news/
document/_flush
),
I did refreshing (curl -XPOST 'http://etridorm.iptime.org:9200/news/
document/_refresh
),
but nothing seemed to improve the situation.

Am I doing something wrong here?
I'd appreciate any help.

Ed


Reply | Threaded
Open this post in threaded view
|

Re: Documents indexed, but cannot 'GET' them.

Allan Johns
So I can't GET a document with a known ID, unless I know its parent's ID as well??

This really throws a spanner in the works for me. I started getting the same problem (random GETs failing) when I added parent-child documents to my db. But I was relying on being able to get a document knowing only its ID. Is there a workaround?

thx
A




On Wednesday, April 11, 2012 9:31:45 PM UTC+10, kimchy wrote:
Yea, thats because the child document is routed based on the parent id so they end up in the same shard.

On Tue, Apr 10, 2012 at 3:14 AM, edward choi <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="XdTsJhEYZcAJ">mp2...@...> wrote:

I've found the answer to my problem. (as it is often the case)
After you index a document with the "_parent" set, if you want to "GET" the document, you need to specify the "routing" parameter.
For example, if you index a document with an id "1234" and its parent "5678", then the "GET" command should be,
Hope this helps someone like me.

2012/4/9 mp2893 <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="XdTsJhEYZcAJ">mp2...@...>

Hi,

I have a index named 'news' and a mapping named 'document'
'news' has 5 shards, 2 replicates.
Below is the mapping of 'document'
---------------------------------------------------------------------------------
{
 "document" : {
   "_parent" : {
     "type" : "cluster"
   },
   "_routing" : {
     "required" : true
   },
   "_source" : {
     "enabled" : false
   },
   "properties" : {
     "clusterid" : {
       "type" : "string",
       "store" : "yes"
     },
     "company" : {
       "type" : "string"
     },
     "companyNum" : {
       "type" : "string"
     },
     "count" : {
       "type" : "integer"
     },
     "date" : {
       "type" : "date",
       "store" : "yes",
       "format" : "YYYY-MM-dd"
     },
     "text" : {
       "type" : "string",
       "analyzer" : "snowball",
       "term_vector" : "with_positions_offsets"
     },
     "title" : {
       "type" : "string",
       "boost" : 2.0,
       "analyzer" : "snowball",
       "store" : "yes",
       "term_vector" : "with_positions_offsets"
     },
     "url" : {
       "type" : "string"
     }
   }
 }
}
-------------------------------------------------------------------------------------
Don't mind the '_parent'. I don't think that is relevant.

I bulk-indexed 427410 json-documents. I got no errors. Everything went
fine.
Below is the code for bulk indexing (partial code actually)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   public void insertDocumentBulk(ArrayList<String> lineList, String
documentMapping) throws Exception
   {
       BulkRequestBuilder brb = client.prepareBulk();

       for (String line: lineList)
       {
           JSONObject jobj = new JSONObject(line);
           String id = jobj.getString("docid");
           String clusterId = jobj.getString("clusterid");
           String dateFormat =
convertDateFormat(jobj.getInt("date"));
           JSONArray sentArray = jobj.getJSONArray("text");
           String text = jsonArrayToString(sentArray);
           jobj.put("text", text);
           jobj.put("date", dateFormat);
           jobj.remove("docid");
           brb.add(client.prepareIndex(index, documentMapping,
id).setParent(clusterId).setSource(jobj.toString()));
       }

       BulkResponse bulkResponse = brb.execute().actionGet();

       int count = 0;
       if (bulkResponse.hasFailures())
       {
           count++;
       }
       System.out.println("error count: " + count);
   }
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
But when I try to 'GET' some of the documents, I get nothing.
For example, if I do a query "curl -XGET 'http://etridorm.iptime.org:
9200/news/document/20110803_0_96464759519569618?fields=title'",
I get
"{"_index":"news","_type":"document","_id":"20110803_0_96464759519569618","exists":false}".

It's not that I can't 'GET' all the documents. Some of them are
accessible, but some of them aren't.
But the funny thing is, when I do a query "curl -XGET 'http://
etridorm.iptime.org:9200/news/document/_count?q=*'"
I get {"count":427410,"_shards":{"total":5,"successful":5,"failed":
0}}", which is the exact same number of documents I had indexed.

I did flushing (curl -XPOST 'http://etridorm.iptime.org:9200/news/
document/_flush
),
I did refreshing (curl -XPOST 'http://etridorm.iptime.org:9200/news/
document/_refresh
),
but nothing seemed to improve the situation.

Am I doing something wrong here?
I'd appreciate any help.

Ed


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.