Attachment Plugin Questions on Storing

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Attachment Plugin Questions on Storing

Mike Gaffney
I'm trying to make use of the attachments plugin. I've got the
following Mapping:

{
       "docs":{
               "properties" : {
                       "contents" : {
                               "type" : "attachment",
                               "fields" : {
                                       "contents" : {"store" : "no"}
                               }
                       },
                       "lastModified": { "type" : "long", "index" : "analyzed", "store" : "no"}
               }
       }
}

And the following index code:

                                               XContentBuilder objectBuilder = jsonBuilder().startObject();

                                               objectBuilder.startObject(
Index.CONTENTS);
                                               if (extension.equals("xml")){
                                                       objectBuilder.field("_content_type", MimeTypes.XML);
                                               }
                                               else {
                                                       objectBuilder.field("_content_type", MimeTypes.PLAIN_TEXT);
                                               }
                                               objectBuilder.field("_name", file.getName());
                                               objectBuilder.field("content",
Base64.encodeBase64(FileUtils.readFileToString(file).getBytes()));
                                               objectBuilder.endObject();

                                               objectBuilder.field(Index.LAST_MODIFIED, file.lastModified());
                                               objectBuilder.endObject();
                                               IndexRequestBuilder setSource = client.prepareIndex(Index.INDEX,
Index.TYPE, file.getAbsolutePath()).setSource(objectBuilder);
                                               setSource.execute().actionGet();

But when I look at the indexing on the server I see:
{
   doc: {
       properties: {
           lastModified: {
               index: "analyzed"
               type: "long"
           }
           contents: {
               path: "full"
               type: "attachment"
               fields: {
                   author: {
                       type: "string"
                   }
                   title: {
                       type: "string"
                   }
                   keywords: {
                       type: "string"
                   }
                   contents: {
                       type: "string"
                   }
                   date: {
                       format: "dateOptionalTime"
                       type: "date"
                   }
                   content_type: {
                       type: "string"
                   }
               }
           }
       }
   }
}

Basically, I don't really want to store the contents, just index the
documents and be able to search on them. I'm indexing files that are
on the computer already so I don't need the contents, and in fact it's
taking up a ton of space to have the contents in there.

Another question is, the contents seem to just be the base64. Is that
correct or am I doing something incorrectly.

I'm using this as a local machine file search mechanism for a large
art / document tree that each user has locally on their machines.

My results look like this (sorry for the redactions, it's proprietary info:

{"took":4,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1663,"max_score":1.0,"hits":[{"_index":"docs","_type":"doc","_id":"the_art_redacted","_score":1.0,"fields":{"contents":{"content":"REALLY_LONG_BASE64_STRING","_name":"the_file_name_redacted","_content_type":"text/plain"}}}]}}

Any additional explanation of attachments will be quite helpful.

Thanks,
 Mike
Reply | Threaded
Open this post in threaded view
|

Re: Attachment Plugin Questions on Storing

Paul Loy
can you have a look in your logs for statements about mapping changes. It may be that you don't have everything specified in your mapping so it's getting overridden by dynamic mappings.

You should see update mapping messages in your logs.

On Thu, Sep 29, 2011 at 11:44 AM, Mike Gaffney <[hidden email]> wrote:
I'm trying to make use of the attachments plugin. I've got the
following Mapping:

{
       "docs":{
               "properties" : {
                       "contents" : {
                               "type" : "attachment",
                               "fields" : {
                                       "contents" : {"store" : "no"}
                               }
                       },
                       "lastModified": { "type" : "long", "index" : "analyzed", "store" : "no"}
               }
       }
}

And the following index code:

                                               XContentBuilder objectBuilder = jsonBuilder().startObject();

                                               objectBuilder.startObject(
Index.CONTENTS);
                                               if (extension.equals("xml")){
                                                       objectBuilder.field("_content_type", MimeTypes.XML);
                                               }
                                               else {
                                                       objectBuilder.field("_content_type", MimeTypes.PLAIN_TEXT);
                                               }
                                               objectBuilder.field("_name", file.getName());
                                               objectBuilder.field("content",
Base64.encodeBase64(FileUtils.readFileToString(file).getBytes()));
                                               objectBuilder.endObject();

                                               objectBuilder.field(Index.LAST_MODIFIED, file.lastModified());
                                               objectBuilder.endObject();
                                               IndexRequestBuilder setSource = client.prepareIndex(Index.INDEX,
Index.TYPE, file.getAbsolutePath()).setSource(objectBuilder);
                                               setSource.execute().actionGet();

But when I look at the indexing on the server I see:
{
   doc: {
       properties: {
           lastModified: {
               index: "analyzed"
               type: "long"
           }
           contents: {
               path: "full"
               type: "attachment"
               fields: {
                   author: {
                       type: "string"
                   }
                   title: {
                       type: "string"
                   }
                   keywords: {
                       type: "string"
                   }
                   contents: {
                       type: "string"
                   }
                   date: {
                       format: "dateOptionalTime"
                       type: "date"
                   }
                   content_type: {
                       type: "string"
                   }
               }
           }
       }
   }
}

Basically, I don't really want to store the contents, just index the
documents and be able to search on them. I'm indexing files that are
on the computer already so I don't need the contents, and in fact it's
taking up a ton of space to have the contents in there.

Another question is, the contents seem to just be the base64. Is that
correct or am I doing something incorrectly.

I'm using this as a local machine file search mechanism for a large
art / document tree that each user has locally on their machines.

My results look like this (sorry for the redactions, it's proprietary info:

{"took":4,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1663,"max_score":1.0,"hits":[{"_index":"docs","_type":"doc","_id":"the_art_redacted","_score":1.0,"fields":{"contents":{"content":"REALLY_LONG_BASE64_STRING","_name":"the_file_name_redacted","_content_type":"text/plain"}}}]}}

Any additional explanation of attachments will be quite helpful.

Thanks,
 Mike



--
---------------------------------------------
Paul Loy
[hidden email]
http://uk.linkedin.com/in/paulloy
Reply | Threaded
Open this post in threaded view
|

Re: Attachment Plugin Questions on Storing

Mike Gaffney
On debug all I see related to mapping is:

20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.index.mapper  - [Ani-Mator]
[docs] using dynamic[true], default mapping: location[null] and
source[{
    "_default_" : {
    }
}]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG
org.elasticsearch.index.cache.field.data.resident  - [Ani-Mator]
[docs] using [resident] field cache with max_size [-1], expire [null]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.index.cache  - [Ani-Mator]
[docs] Using stats.refresh_interval [1s]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] INFO  org.elasticsearch.cluster.metadata  - [Ani-
Mator] [docs] creating index, cause [api], shards [5]/[1], mappings
[doc]

I'm guessing that using dynamic[true] means that I'm not doing it
right.

On Sep 29, 11:52 am, Paul Loy <[hidden email]> wrote:

> can you have a look in your logs for statements about mapping changes. It
> may be that you don't have everything specified in your mapping so it's
> getting overridden by dynamic mappings.
>
> You should see update mapping messages in your logs.
>
>
>
>
>
>
>
>
>
> On Thu, Sep 29, 2011 at 11:44 AM, Mike Gaffney <[hidden email]> wrote:
> > I'm trying to make use of the attachments plugin. I've got the
> > following Mapping:
>
> > {
> >        "docs":{
> >                "properties" : {
> >                        "contents" : {
> >                                "type" : "attachment",
> >                                "fields" : {
> >                                        "contents" : {"store" : "no"}
> >                                }
> >                        },
> >                        "lastModified": { "type" : "long", "index" :
> > "analyzed", "store" : "no"}
> >                }
> >        }
> > }
>
> > And the following index code:
>
> >                                                XContentBuilder
> > objectBuilder = jsonBuilder().startObject();
>
> >                                                objectBuilder.startObject(
> > **Index.CONTENTS);
> >                                                if
> > (extension.equals("xml")){
>
> >  objectBuilder.field("_content_**type", MimeTypes.XML);
> >                                                }
> >                                                else {
>
> >  objectBuilder.field("_content_**type", MimeTypes.PLAIN_TEXT);
> >                                                }
> >                                                objectBuilder.field("_name",
> > file.getName());
>
> >  objectBuilder.field("content",
> > Base64.encodeBase64(FileUtils.**readFileToString(file).**getBytes()));
> >                                                objectBuilder.endObject();
>
> >                                                objectBuilder.field(Index.*
> > *LAST_MODIFIED, file.lastModified());
> >                                                objectBuilder.endObject();
> >                                                IndexRequestBuilder
> > setSource = client.prepareIndex(Index.**INDEX,
> > Index.TYPE, file.getAbsolutePath()).**setSource(objectBuilder);
>
> >  setSource.execute().actionGet(**);
>
> > But when I look at the indexing on the server I see:
> > {
> >    doc: {
> >        properties: {
> >            lastModified: {
> >                index: "analyzed"
> >                type: "long"
> >            }
> >            contents: {
> >                path: "full"
> >                type: "attachment"
> >                fields: {
> >                    author: {
> >                        type: "string"
> >                    }
> >                    title: {
> >                        type: "string"
> >                    }
> >                    keywords: {
> >                        type: "string"
> >                    }
> >                    contents: {
> >                        type: "string"
> >                    }
> >                    date: {
> >                        format: "dateOptionalTime"
> >                        type: "date"
> >                    }
> >                    content_type: {
> >                        type: "string"
> >                    }
> >                }
> >            }
> >        }
> >    }
> > }
>
> > Basically, I don't really want to store the contents, just index the
> > documents and be able to search on them. I'm indexing files that are
> > on the computer already so I don't need the contents, and in fact it's
> > taking up a ton of space to have the contents in there.
>
> > Another question is, the contents seem to just be the base64. Is that
> > correct or am I doing something incorrectly.
>
> > I'm using this as a local machine file search mechanism for a large
> > art / document tree that each user has locally on their machines.
>
> > My results look like this (sorry for the redactions, it's proprietary info:
>
> > {"took":4,"timed_out":false,"_**shards":{"total":5,"**
> > successful":5,"failed":0},"**hits":{"total":1663,"max_**
> > score":1.0,"hits":[{"_index":"**docs","_type":"doc","_id":"**
> > the_art_redacted","_score":1.**0,"fields":{"contents":{"**
> > content":"REALLY_LONG_BASE64_**STRING","_name":"the_file_**
> > name_redacted","_content_type"**:"text/plain"}}}]}}
>
> > Any additional explanation of attachments will be quite helpful.
>
> > Thanks,
> >   Mike
>
> --
> ---------------------------------------------
> Paul Loy
> [hidden email]://uk.linkedin.com/in/paulloy
Reply | Threaded
Open this post in threaded view
|

Re: Attachment Plugin Questions on Storing

Paul Loy
If you can gist full logs, mappings, settings, code, etc, (or as much as you can without giving away proprietary stuff) that's quite useful ;)

So at the end you have a create index[api] for docs. Do you push the mappings in there? Can I see that code?

On Thu, Sep 29, 2011 at 1:04 PM, Mike Gaffney <[hidden email]> wrote:
On debug all I see related to mapping is:

20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.index.mapper  - [Ani-Mator]
[docs] using dynamic[true], default mapping: location[null] and
source[{
   "_default_" : {
   }
}]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG
org.elasticsearch.index.cache.field.data.resident  - [Ani-Mator]
[docs] using [resident] field cache with max_size [-1], expire [null]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.index.cache  - [Ani-Mator]
[docs] Using stats.refresh_interval [1s]
20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
pool-11-thread-1] INFO  org.elasticsearch.cluster.metadata  - [Ani-
Mator] [docs] creating index, cause [api], shards [5]/[1], mappings
[doc]

I'm guessing that using dynamic[true] means that I'm not doing it
right.

On Sep 29, 11:52 am, Paul Loy <[hidden email]> wrote:
> can you have a look in your logs for statements about mapping changes. It
> may be that you don't have everything specified in your mapping so it's
> getting overridden by dynamic mappings.
>
> You should see update mapping messages in your logs.
>
>
>
>
>
>
>
>
>
> On Thu, Sep 29, 2011 at 11:44 AM, Mike Gaffney <[hidden email]> wrote:
> > I'm trying to make use of the attachments plugin. I've got the
> > following Mapping:
>
> > {
> >        "docs":{
> >                "properties" : {
> >                        "contents" : {
> >                                "type" : "attachment",
> >                                "fields" : {
> >                                        "contents" : {"store" : "no"}
> >                                }
> >                        },
> >                        "lastModified": { "type" : "long", "index" :
> > "analyzed", "store" : "no"}
> >                }
> >        }
> > }
>
> > And the following index code:
>
> >                                                XContentBuilder
> > objectBuilder = jsonBuilder().startObject();
>
> >                                                objectBuilder.startObject(
> > **Index.CONTENTS);
> >                                                if
> > (extension.equals("xml")){
>
> >  objectBuilder.field("_content_**type", MimeTypes.XML);
> >                                                }
> >                                                else {
>
> >  objectBuilder.field("_content_**type", MimeTypes.PLAIN_TEXT);
> >                                                }
> >                                                objectBuilder.field("_name",
> > file.getName());
>
> >  objectBuilder.field("content",
> > Base64.encodeBase64(FileUtils.**readFileToString(file).**getBytes()));
> >                                                objectBuilder.endObject();
>
> >                                                objectBuilder.field(Index.*
> > *LAST_MODIFIED, file.lastModified());
> >                                                objectBuilder.endObject();
> >                                                IndexRequestBuilder
> > setSource = client.prepareIndex(Index.**INDEX,
> > Index.TYPE, file.getAbsolutePath()).**setSource(objectBuilder);
>
> >  setSource.execute().actionGet(**);
>
> > But when I look at the indexing on the server I see:
> > {
> >    doc: {
> >        properties: {
> >            lastModified: {
> >                index: "analyzed"
> >                type: "long"
> >            }
> >            contents: {
> >                path: "full"
> >                type: "attachment"
> >                fields: {
> >                    author: {
> >                        type: "string"
> >                    }
> >                    title: {
> >                        type: "string"
> >                    }
> >                    keywords: {
> >                        type: "string"
> >                    }
> >                    contents: {
> >                        type: "string"
> >                    }
> >                    date: {
> >                        format: "dateOptionalTime"
> >                        type: "date"
> >                    }
> >                    content_type: {
> >                        type: "string"
> >                    }
> >                }
> >            }
> >        }
> >    }
> > }
>
> > Basically, I don't really want to store the contents, just index the
> > documents and be able to search on them. I'm indexing files that are
> > on the computer already so I don't need the contents, and in fact it's
> > taking up a ton of space to have the contents in there.
>
> > Another question is, the contents seem to just be the base64. Is that
> > correct or am I doing something incorrectly.
>
> > I'm using this as a local machine file search mechanism for a large
> > art / document tree that each user has locally on their machines.
>
> > My results look like this (sorry for the redactions, it's proprietary info:
>
> > {"took":4,"timed_out":false,"_**shards":{"total":5,"**
> > successful":5,"failed":0},"**hits":{"total":1663,"max_**
> > score":1.0,"hits":[{"_index":"**docs","_type":"doc","_id":"**
> > the_art_redacted","_score":1.**0,"fields":{"contents":{"**
> > content":"REALLY_LONG_BASE64_**STRING","_name":"the_file_**
> > name_redacted","_content_type"**:"text/plain"}}}]}}
>
> > Any additional explanation of attachments will be quite helpful.
>
> > Thanks,
> >   Mike
>
> --
> ---------------------------------------------
> Paul Loy
> [hidden email]://uk.linkedin.com/in/paulloy



--
---------------------------------------------
Paul Loy
[hidden email]
http://uk.linkedin.com/in/paulloy
Reply | Threaded
Open this post in threaded view
|

Re: Attachment Plugin Questions on Storing

Mike Gaffney
https://gist.github.com/1251943

On Sep 29, 1:08 pm, Paul Loy <[hidden email]> wrote:

> If you can gist full logs, mappings, settings, code, etc, (or as much as you
> can without giving away proprietary stuff) that's quite useful ;)
>
> So at the end you have a create index[api] for docs. Do you push the
> mappings in there? Can I see that code?
>
>
>
>
>
>
>
>
>
> On Thu, Sep 29, 2011 at 1:04 PM, Mike Gaffney <[hidden email]> wrote:
> > On debug all I see related to mapping is:
>
> > 20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
> > pool-11-thread-1] DEBUG org.elasticsearch.index.mapper  - [Ani-Mator]
> > [docs] using dynamic[true], default mapping: location[null] and
> > source[{
> >    "_default_" : {
> >    }
> > }]
> > 20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
> > pool-11-thread-1] DEBUG
> > org.elasticsearch.index.cache.field.data.resident  - [Ani-Mator]
> > [docs] using [resident] field cache with max_size [-1], expire [null]
> > 20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
> > pool-11-thread-1] DEBUG org.elasticsearch.index.cache  - [Ani-Mator]
> > [docs] Using stats.refresh_interval [1s]
> > 20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
> > pool-11-thread-1] INFO  org.elasticsearch.cluster.metadata  - [Ani-
> > Mator] [docs] creating index, cause [api], shards [5]/[1], mappings
> > [doc]
>
> > I'm guessing that using dynamic[true] means that I'm not doing it
> > right.
>
> > On Sep 29, 11:52 am, Paul Loy <[hidden email]> wrote:
> > > can you have a look in your logs for statements about mapping changes. It
> > > may be that you don't have everything specified in your mapping so it's
> > > getting overridden by dynamic mappings.
>
> > > You should see update mapping messages in your logs.
>
> > > On Thu, Sep 29, 2011 at 11:44 AM, Mike Gaffney <[hidden email]>
> > wrote:
> > > > I'm trying to make use of the attachments plugin. I've got the
> > > > following Mapping:
>
> > > > {
> > > >        "docs":{
> > > >                "properties" : {
> > > >                        "contents" : {
> > > >                                "type" : "attachment",
> > > >                                "fields" : {
> > > >                                        "contents" : {"store" : "no"}
> > > >                                }
> > > >                        },
> > > >                        "lastModified": { "type" : "long", "index" :
> > > > "analyzed", "store" : "no"}
> > > >                }
> > > >        }
> > > > }
>
> > > > And the following index code:
>
> > > >                                                XContentBuilder
> > > > objectBuilder = jsonBuilder().startObject();
>
> >  objectBuilder.startObject(
> > > > **Index.CONTENTS);
> > > >                                                if
> > > > (extension.equals("xml")){
>
> > > >  objectBuilder.field("_content_**type", MimeTypes.XML);
> > > >                                                }
> > > >                                                else {
>
> > > >  objectBuilder.field("_content_**type", MimeTypes.PLAIN_TEXT);
> > > >                                                }
>
> >  objectBuilder.field("_name",
> > > > file.getName());
>
> > > >  objectBuilder.field("content",
> > > > Base64.encodeBase64(FileUtils.**readFileToString(file).**getBytes()));
>
> >  objectBuilder.endObject();
>
> >  objectBuilder.field(Index.*
> > > > *LAST_MODIFIED, file.lastModified());
>
> >  objectBuilder.endObject();
> > > >                                                IndexRequestBuilder
> > > > setSource = client.prepareIndex(Index.**INDEX,
> > > > Index.TYPE, file.getAbsolutePath()).**setSource(objectBuilder);
>
> > > >  setSource.execute().actionGet(**);
>
> > > > But when I look at the indexing on the server I see:
> > > > {
> > > >    doc: {
> > > >        properties: {
> > > >            lastModified: {
> > > >                index: "analyzed"
> > > >                type: "long"
> > > >            }
> > > >            contents: {
> > > >                path: "full"
> > > >                type: "attachment"
> > > >                fields: {
> > > >                    author: {
> > > >                        type: "string"
> > > >                    }
> > > >                    title: {
> > > >                        type: "string"
> > > >                    }
> > > >                    keywords: {
> > > >                        type: "string"
> > > >                    }
> > > >                    contents: {
> > > >                        type: "string"
> > > >                    }
> > > >                    date: {
> > > >                        format: "dateOptionalTime"
> > > >                        type: "date"
> > > >                    }
> > > >                    content_type: {
> > > >                        type: "string"
> > > >                    }
> > > >                }
> > > >            }
> > > >        }
> > > >    }
> > > > }
>
> > > > Basically, I don't really want to store the contents, just index the
> > > > documents and be able to search on them. I'm indexing files that are
> > > > on the computer already so I don't need the contents, and in fact it's
> > > > taking up a ton of space to have the contents in there.
>
> > > > Another question is, the contents seem to just be the base64. Is that
> > > > correct or am I doing something incorrectly.
>
> > > > I'm using this as a local machine file search mechanism for a large
> > > > art / document tree that each user has locally on their machines.
>
> > > > My results look like this (sorry for the redactions, it's proprietary
> > info:
>
> > > > {"took":4,"timed_out":false,"_**shards":{"total":5,"**
> > > > successful":5,"failed":0},"**hits":{"total":1663,"max_**
> > > > score":1.0,"hits":[{"_index":"**docs","_type":"doc","_id":"**
> > > > the_art_redacted","_score":1.**0,"fields":{"contents":{"**
> > > > content":"REALLY_LONG_BASE64_**STRING","_name":"the_file_**
> > > > name_redacted","_content_type"**:"text/plain"}}}]}}
>
> > > > Any additional explanation of attachments will be quite helpful.
>
> > > > Thanks,
> > > >   Mike
>
> > > --
> > > ---------------------------------------------
> > > Paul Loy
> > > [hidden email]://uk.linkedin.com/in/paulloy
>
> --
> ---------------------------------------------
> Paul Loy
> [hidden email]://uk.linkedin.com/in/paulloy
Reply | Threaded
Open this post in threaded view
|

Re: Attachment Plugin Questions on Storing

Paul Loy
I would have expected that the following line would have caused put mapping, cause [api] (or something similar in the logs):

InputStream docsMappings = IndexerMain.class.getResourceAsStream("/mappings/docs.json");


On Thu, Sep 29, 2011 at 2:03 PM, Mike Gaffney <[hidden email]> wrote:
https://gist.github.com/1251943

On Sep 29, 1:08 pm, Paul Loy <[hidden email]> wrote:
> If you can gist full logs, mappings, settings, code, etc, (or as much as you
> can without giving away proprietary stuff) that's quite useful ;)
>
> So at the end you have a create index[api] for docs. Do you push the
> mappings in there? Can I see that code?
>
>
>
>
>
>
>
>
>
> On Thu, Sep 29, 2011 at 1:04 PM, Mike Gaffney <[hidden email]> wrote:
> > On debug all I see related to mapping is:
>
> > 20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
> > pool-11-thread-1] DEBUG org.elasticsearch.index.mapper  - [Ani-Mator]
> > [docs] using dynamic[true], default mapping: location[null] and
> > source[{
> >    "_default_" : {
> >    }
> > }]
> > 20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
> > pool-11-thread-1] DEBUG
> > org.elasticsearch.index.cache.field.data.resident  - [Ani-Mator]
> > [docs] using [resident] field cache with max_size [-1], expire [null]
> > 20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
> > pool-11-thread-1] DEBUG org.elasticsearch.index.cache  - [Ani-Mator]
> > [docs] Using stats.refresh_interval [1s]
> > 20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
> > pool-11-thread-1] INFO  org.elasticsearch.cluster.metadata  - [Ani-
> > Mator] [docs] creating index, cause [api], shards [5]/[1], mappings
> > [doc]
>
> > I'm guessing that using dynamic[true] means that I'm not doing it
> > right.
>
> > On Sep 29, 11:52 am, Paul Loy <[hidden email]> wrote:
> > > can you have a look in your logs for statements about mapping changes. It
> > > may be that you don't have everything specified in your mapping so it's
> > > getting overridden by dynamic mappings.
>
> > > You should see update mapping messages in your logs.
>
> > > On Thu, Sep 29, 2011 at 11:44 AM, Mike Gaffney <[hidden email]>
> > wrote:
> > > > I'm trying to make use of the attachments plugin. I've got the
> > > > following Mapping:
>
> > > > {
> > > >        "docs":{
> > > >                "properties" : {
> > > >                        "contents" : {
> > > >                                "type" : "attachment",
> > > >                                "fields" : {
> > > >                                        "contents" : {"store" : "no"}
> > > >                                }
> > > >                        },
> > > >                        "lastModified": { "type" : "long", "index" :
> > > > "analyzed", "store" : "no"}
> > > >                }
> > > >        }
> > > > }
>
> > > > And the following index code:
>
> > > >                                                XContentBuilder
> > > > objectBuilder = jsonBuilder().startObject();
>
> >  objectBuilder.startObject(
> > > > **Index.CONTENTS);
> > > >                                                if
> > > > (extension.equals("xml")){
>
> > > >  objectBuilder.field("_content_**type", MimeTypes.XML);
> > > >                                                }
> > > >                                                else {
>
> > > >  objectBuilder.field("_content_**type", MimeTypes.PLAIN_TEXT);
> > > >                                                }
>
> >  objectBuilder.field("_name",
> > > > file.getName());
>
> > > >  objectBuilder.field("content",
> > > > Base64.encodeBase64(FileUtils.**readFileToString(file).**getBytes()));
>
> >  objectBuilder.endObject();
>
> >  objectBuilder.field(Index.*
> > > > *LAST_MODIFIED, file.lastModified());
>
> >  objectBuilder.endObject();
> > > >                                                IndexRequestBuilder
> > > > setSource = client.prepareIndex(Index.**INDEX,
> > > > Index.TYPE, file.getAbsolutePath()).**setSource(objectBuilder);
>
> > > >  setSource.execute().actionGet(**);
>
> > > > But when I look at the indexing on the server I see:
> > > > {
> > > >    doc: {
> > > >        properties: {
> > > >            lastModified: {
> > > >                index: "analyzed"
> > > >                type: "long"
> > > >            }
> > > >            contents: {
> > > >                path: "full"
> > > >                type: "attachment"
> > > >                fields: {
> > > >                    author: {
> > > >                        type: "string"
> > > >                    }
> > > >                    title: {
> > > >                        type: "string"
> > > >                    }
> > > >                    keywords: {
> > > >                        type: "string"
> > > >                    }
> > > >                    contents: {
> > > >                        type: "string"
> > > >                    }
> > > >                    date: {
> > > >                        format: "dateOptionalTime"
> > > >                        type: "date"
> > > >                    }
> > > >                    content_type: {
> > > >                        type: "string"
> > > >                    }
> > > >                }
> > > >            }
> > > >        }
> > > >    }
> > > > }
>
> > > > Basically, I don't really want to store the contents, just index the
> > > > documents and be able to search on them. I'm indexing files that are
> > > > on the computer already so I don't need the contents, and in fact it's
> > > > taking up a ton of space to have the contents in there.
>
> > > > Another question is, the contents seem to just be the base64. Is that
> > > > correct or am I doing something incorrectly.
>
> > > > I'm using this as a local machine file search mechanism for a large
> > > > art / document tree that each user has locally on their machines.
>
> > > > My results look like this (sorry for the redactions, it's proprietary
> > info:
>
> > > > {"took":4,"timed_out":false,"_**shards":{"total":5,"**
> > > > successful":5,"failed":0},"**hits":{"total":1663,"max_**
> > > > score":1.0,"hits":[{"_index":"**docs","_type":"doc","_id":"**
> > > > the_art_redacted","_score":1.**0,"fields":{"contents":{"**
> > > > content":"REALLY_LONG_BASE64_**STRING","_name":"the_file_**
> > > > name_redacted","_content_type"**:"text/plain"}}}]}}
>
> > > > Any additional explanation of attachments will be quite helpful.
>
> > > > Thanks,
> > > >   Mike
>
> > > --
> > > ---------------------------------------------
> > > Paul Loy
> > > [hidden email]://uk.linkedin.com/in/paulloy
>
> --
> ---------------------------------------------
> Paul Loy
> [hidden email]://uk.linkedin.com/in/paulloy



--
---------------------------------------------
Paul Loy
[hidden email]
http://uk.linkedin.com/in/paulloy
Reply | Threaded
Open this post in threaded view
|

Re: Attachment Plugin Questions on Storing

Paul Loy
or rather the block:

InputStream docsMappings = IndexerMain.class.getResourceAsStream("/mappings/docs.json");
String docsMappingAsString = IOUtils.toString(docsMappings);
prepareCreate.addMapping(Index.TYPE, docsMappingAsString);
prepareCreate.execute().actionGet();


On Thu, Sep 29, 2011 at 2:08 PM, Paul Loy <[hidden email]> wrote:
I would have expected that the following line would have caused put mapping, cause [api] (or something similar in the logs):

InputStream docsMappings = IndexerMain.class.getResourceAsStream("/mappings/docs.json");


On Thu, Sep 29, 2011 at 2:03 PM, Mike Gaffney <[hidden email]> wrote:
https://gist.github.com/1251943

On Sep 29, 1:08 pm, Paul Loy <[hidden email]> wrote:
> If you can gist full logs, mappings, settings, code, etc, (or as much as you
> can without giving away proprietary stuff) that's quite useful ;)
>
> So at the end you have a create index[api] for docs. Do you push the
> mappings in there? Can I see that code?
>
>
>
>
>
>
>
>
>
> On Thu, Sep 29, 2011 at 1:04 PM, Mike Gaffney <[hidden email]> wrote:
> > On debug all I see related to mapping is:
>
> > 20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
> > pool-11-thread-1] DEBUG org.elasticsearch.index.mapper  - [Ani-Mator]
> > [docs] using dynamic[true], default mapping: location[null] and
> > source[{
> >    "_default_" : {
> >    }
> > }]
> > 20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
> > pool-11-thread-1] DEBUG
> > org.elasticsearch.index.cache.field.data.resident  - [Ani-Mator]
> > [docs] using [resident] field cache with max_size [-1], expire [null]
> > 20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
> > pool-11-thread-1] DEBUG org.elasticsearch.index.cache  - [Ani-Mator]
> > [docs] Using stats.refresh_interval [1s]
> > 20110929-12:59:45 [elasticsearch[Ani-Mator]clusterService#updateTask-
> > pool-11-thread-1] INFO  org.elasticsearch.cluster.metadata  - [Ani-
> > Mator] [docs] creating index, cause [api], shards [5]/[1], mappings
> > [doc]
>
> > I'm guessing that using dynamic[true] means that I'm not doing it
> > right.
>
> > On Sep 29, 11:52 am, Paul Loy <[hidden email]> wrote:
> > > can you have a look in your logs for statements about mapping changes. It
> > > may be that you don't have everything specified in your mapping so it's
> > > getting overridden by dynamic mappings.
>
> > > You should see update mapping messages in your logs.
>
> > > On Thu, Sep 29, 2011 at 11:44 AM, Mike Gaffney <[hidden email]>
> > wrote:
> > > > I'm trying to make use of the attachments plugin. I've got the
> > > > following Mapping:
>
> > > > {
> > > >        "docs":{
> > > >                "properties" : {
> > > >                        "contents" : {
> > > >                                "type" : "attachment",
> > > >                                "fields" : {
> > > >                                        "contents" : {"store" : "no"}
> > > >                                }
> > > >                        },
> > > >                        "lastModified": { "type" : "long", "index" :
> > > > "analyzed", "store" : "no"}
> > > >                }
> > > >        }
> > > > }
>
> > > > And the following index code:
>
> > > >                                                XContentBuilder
> > > > objectBuilder = jsonBuilder().startObject();
>
> >  objectBuilder.startObject(
> > > > **Index.CONTENTS);
> > > >                                                if
> > > > (extension.equals("xml")){
>
> > > >  objectBuilder.field("_content_**type", MimeTypes.XML);
> > > >                                                }
> > > >                                                else {
>
> > > >  objectBuilder.field("_content_**type", MimeTypes.PLAIN_TEXT);
> > > >                                                }
>
> >  objectBuilder.field("_name",
> > > > file.getName());
>
> > > >  objectBuilder.field("content",
> > > > Base64.encodeBase64(FileUtils.**readFileToString(file).**getBytes()));
>
> >  objectBuilder.endObject();
>
> >  objectBuilder.field(Index.*
> > > > *LAST_MODIFIED, file.lastModified());
>
> >  objectBuilder.endObject();
> > > >                                                IndexRequestBuilder
> > > > setSource = client.prepareIndex(Index.**INDEX,
> > > > Index.TYPE, file.getAbsolutePath()).**setSource(objectBuilder);
>
> > > >  setSource.execute().actionGet(**);
>
> > > > But when I look at the indexing on the server I see:
> > > > {
> > > >    doc: {
> > > >        properties: {
> > > >            lastModified: {
> > > >                index: "analyzed"
> > > >                type: "long"
> > > >            }
> > > >            contents: {
> > > >                path: "full"
> > > >                type: "attachment"
> > > >                fields: {
> > > >                    author: {
> > > >                        type: "string"
> > > >                    }
> > > >                    title: {
> > > >                        type: "string"
> > > >                    }
> > > >                    keywords: {
> > > >                        type: "string"
> > > >                    }
> > > >                    contents: {
> > > >                        type: "string"
> > > >                    }
> > > >                    date: {
> > > >                        format: "dateOptionalTime"
> > > >                        type: "date"
> > > >                    }
> > > >                    content_type: {
> > > >                        type: "string"
> > > >                    }
> > > >                }
> > > >            }
> > > >        }
> > > >    }
> > > > }
>
> > > > Basically, I don't really want to store the contents, just index the
> > > > documents and be able to search on them. I'm indexing files that are
> > > > on the computer already so I don't need the contents, and in fact it's
> > > > taking up a ton of space to have the contents in there.
>
> > > > Another question is, the contents seem to just be the base64. Is that
> > > > correct or am I doing something incorrectly.
>
> > > > I'm using this as a local machine file search mechanism for a large
> > > > art / document tree that each user has locally on their machines.
>
> > > > My results look like this (sorry for the redactions, it's proprietary
> > info:
>
> > > > {"took":4,"timed_out":false,"_**shards":{"total":5,"**
> > > > successful":5,"failed":0},"**hits":{"total":1663,"max_**
> > > > score":1.0,"hits":[{"_index":"**docs","_type":"doc","_id":"**
> > > > the_art_redacted","_score":1.**0,"fields":{"contents":{"**
> > > > content":"REALLY_LONG_BASE64_**STRING","_name":"the_file_**
> > > > name_redacted","_content_type"**:"text/plain"}}}]}}
>
> > > > Any additional explanation of attachments will be quite helpful.
>
> > > > Thanks,
> > > >   Mike
>
> > > --
> > > ---------------------------------------------
> > > Paul Loy
> > > [hidden email]://uk.linkedin.com/in/paulloy
>
> --
> ---------------------------------------------
> Paul Loy
> [hidden email]://uk.linkedin.com/in/paulloy



--
---------------------------------------------
Paul Loy
[hidden email]



--
---------------------------------------------
Paul Loy
[hidden email]
http://uk.linkedin.com/in/paulloy
Reply | Threaded
Open this post in threaded view
|

Re: Attachment Plugin Questions on Storing

Mike Gaffney
added the log output. There is a create by api that happens. But not
much else that I can tell.
Reply | Threaded
Open this post in threaded view
|

Re: Attachment Plugin Questions on Storing

Paul Loy
20110929-14:22:17 [elasticsearch[Xi'an Chi Xan]clusterService#updateTask-pool-11-thread-1] DEBUG org.elasticsearch.index.mapper - [Xi'an Chi Xan] [docs] using dynamic[true], default mapping: location[null] and source[{
    "_default_" : {
    }
}]
So yeah, that doesn't look good. Can you try putting the mapping after creating the index?

On Thu, Sep 29, 2011 at 2:25 PM, Mike Gaffney <[hidden email]> wrote:
added the log output. There is a create by api that happens. But not
much else that I can tell.



--
---------------------------------------------
Paul Loy
[hidden email]
http://uk.linkedin.com/in/paulloy
Reply | Threaded
Open this post in threaded view
|

Re: Attachment Plugin Questions on Storing

Mike Gaffney
Done. I get this log output:

20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] TRACE
org.elasticsearch.index.shard.service  - [White Tiger] [docs][4]
refresh with waitForOperations[false]
20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
org.elasticsearch.index.gateway  - [White Tiger] [docs][4] recovery
completed from local, took [6ms]
    index    : files           [0] with total_size [0b], took[1ms]
             : recovered_files [0] with total_size [0b]
             : reusing_files   [0] with total_size [0b]
    translog : number_of_operations [0], took [6ms]
20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
org.elasticsearch.cluster.action.shard  - [White Tiger] sending shard
started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
s[INITIALIZING], reason [after recovery from gateway]
20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
org.elasticsearch.cluster.action.shard  - [White Tiger] received shard
started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
s[INITIALIZING], reason [after recovery from gateway]
20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
pool-11-thread-1] DEBUG org.elasticsearch.cluster.metadata  - [White
Tiger] [docs] create_mapping [doc] with source [{"doc":{"properties":
{"contents":{"type":"attachment","path":"full","fields":{"contents":
{"type":"string"},"author":{"type":"string"},"title":
{"type":"string"},"date":
{"type":"date","format":"dateOptionalTime"},"keywords":
{"type":"string"},"content_type":{"type":"string"}}},"lastModified":
{"type":"long","index":"analyzed"}}}}]
... added
20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
pool-11-thread-1] TRACE org.elasticsearch.cluster.service  - [White
Tiger] cluster state updated:
version [5], source [put-mapping [doc]]
nodes:

with this config:

{
        "docs":{
                "properties" : {
                    "contents": {
                    "type" : "attachment",
                    "path":"full",
                    "store": "no",
                    "fields":{
                    "contents":{"type":"string", "store": "no",
"index":"analyzed"},
                    "author":{"type":"string"},
                    "title":{"type":"string"},
                    "date":
{"type":"date","store":"no","format":"dateOptionalTime"},
                    "keywords":{"type":"string"},
                    "content_type":{"type":"string"}
                    }
                    },
                        "lastModified": { "type" : "long", "index" : "analyzed", "store" :
"no"}
                }
        }
}
Reply | Threaded
Open this post in threaded view
|

Re: Attachment Plugin Questions on Storing

Mike Gaffney
I never got the attachment system to stop storing the full document
base64 in the db or from returning it in results. The results isn't
that big of a deal, but I can't really store all of the documents 2x
(once for real and once in the index). We're already harddrive
constrained as we are.

Shay do you have any thoughts?

On Sep 29, 4:12 pm, Mike Gaffney <[hidden email]> wrote:

> Done. I get this log output:
>
> 20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] TRACE
> org.elasticsearch.index.shard.service  - [White Tiger] [docs][4]
> refresh with waitForOperations[false]
> 20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
> org.elasticsearch.index.gateway  - [White Tiger] [docs][4] recovery
> completed from local, took [6ms]
>     index    : files           [0] with total_size [0b], took[1ms]
>              : recovered_files [0] with total_size [0b]
>              : reusing_files   [0] with total_size [0b]
>     translog : number_of_operations [0], took [6ms]
> 20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
> org.elasticsearch.cluster.action.shard  - [White Tiger] sending shard
> started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
> s[INITIALIZING], reason [after recovery from gateway]
> 20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
> org.elasticsearch.cluster.action.shard  - [White Tiger] received shard
> started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
> s[INITIALIZING], reason [after recovery from gateway]
> 20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
> pool-11-thread-1] DEBUG org.elasticsearch.cluster.metadata  - [White
> Tiger] [docs] create_mapping [doc] with source [{"doc":{"properties":
> {"contents":{"type":"attachment","path":"full","fields":{"contents":
> {"type":"string"},"author":{"type":"string"},"title":
> {"type":"string"},"date":
> {"type":"date","format":"dateOptionalTime"},"keywords":
> {"type":"string"},"content_type":{"type":"string"}}},"lastModified":
> {"type":"long","index":"analyzed"}}}}]
> ... added
> 20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
> pool-11-thread-1] TRACE org.elasticsearch.cluster.service  - [White
> Tiger] cluster state updated:
> version [5], source [put-mapping [doc]]
> nodes:
>
> with this config:
>
> {
>         "docs":{
>                 "properties" : {
>                     "contents": {
>                         "type" : "attachment",
>                         "path":"full",
>                         "store": "no",
>                         "fields":{
>                                 "contents":{"type":"string", "store": "no",
> "index":"analyzed"},
>                                 "author":{"type":"string"},
>                                 "title":{"type":"string"},
>                                 "date":
> {"type":"date","store":"no","format":"dateOptionalTime"},
>                                 "keywords":{"type":"string"},
>                                 "content_type":{"type":"string"}
>                         }
>                     },
>                         "lastModified": { "type" : "long", "index" : "analyzed", "store" :
> "no"}
>                 }
>         }
>
>
>
>
>
>
>
> }
Reply | Threaded
Open this post in threaded view
|

Re: Attachment Plugin Questions on Storing

Lukáš Vlček
Hi,

may be there are other possibilities, but, you can completely disable storing the _source [1] and you can also return only selected fields [2] in the search results.


However, your request to disable storing only the attachments base64 data might be reasonable. You are probably not the only user requesting this. On the other hand, this can make things more complicated later because the compete document source may not be available for re-indexing. This is probably up to Shay whether he wants to allow this or not, you can always clone mapper-attachments plugin and do your customizations and try to sent pull request.

just my 2 cents.

Regards,
Lukas

On Wed, Oct 12, 2011 at 8:39 PM, Mike Gaffney <[hidden email]> wrote:
I never got the attachment system to stop storing the full document
base64 in the db or from returning it in results. The results isn't
that big of a deal, but I can't really store all of the documents 2x
(once for real and once in the index). We're already harddrive
constrained as we are.

Shay do you have any thoughts?

On Sep 29, 4:12 pm, Mike Gaffney <[hidden email]> wrote:
> Done. I get this log output:
>
> 20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] TRACE
> org.elasticsearch.index.shard.service  - [White Tiger] [docs][4]
> refresh with waitForOperations[false]
> 20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
> org.elasticsearch.index.gateway  - [White Tiger] [docs][4] recovery
> completed from local, took [6ms]
>     index    : files           [0] with total_size [0b], took[1ms]
>              : recovered_files [0] with total_size [0b]
>              : reusing_files   [0] with total_size [0b]
>     translog : number_of_operations [0], took [6ms]
> 20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
> org.elasticsearch.cluster.action.shard  - [White Tiger] sending shard
> started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
> s[INITIALIZING], reason [after recovery from gateway]
> 20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
> org.elasticsearch.cluster.action.shard  - [White Tiger] received shard
> started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
> s[INITIALIZING], reason [after recovery from gateway]
> 20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
> pool-11-thread-1] DEBUG org.elasticsearch.cluster.metadata  - [White
> Tiger] [docs] create_mapping [doc] with source [{"doc":{"properties":
> {"contents":{"type":"attachment","path":"full","fields":{"contents":
> {"type":"string"},"author":{"type":"string"},"title":
> {"type":"string"},"date":
> {"type":"date","format":"dateOptionalTime"},"keywords":
> {"type":"string"},"content_type":{"type":"string"}}},"lastModified":
> {"type":"long","index":"analyzed"}}}}]
> ... added
> 20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
> pool-11-thread-1] TRACE org.elasticsearch.cluster.service  - [White
> Tiger] cluster state updated:
> version [5], source [put-mapping [doc]]
> nodes:
>
> with this config:
>
> {
>         "docs":{
>                 "properties" : {
>                     "contents": {
>                         "type" : "attachment",
>                         "path":"full",
>                         "store": "no",
>                         "fields":{
>                                 "contents":{"type":"string", "store": "no",
> "index":"analyzed"},
>                                 "author":{"type":"string"},
>                                 "title":{"type":"string"},
>                                 "date":
> {"type":"date","store":"no","format":"dateOptionalTime"},
>                                 "keywords":{"type":"string"},
>                                 "content_type":{"type":"string"}
>                         }
>                     },
>                         "lastModified": { "type" : "long", "index" : "analyzed", "store" :
> "no"}
>                 }
>         }
>
>
>
>
>
>
>
> }

Reply | Threaded
Open this post in threaded view
|

Re: Attachment Plugin Questions on Storing

kimchy
Administrator
Currently, you can disable _source and thus the attachment will not be stored as well. There is no option to "remove" the attachment from _source (the json doc) and store in the _source everything *but* the attachment.

On Wed, Oct 12, 2011 at 9:22 PM, Lukáš Vlček <[hidden email]> wrote:
Hi,

may be there are other possibilities, but, you can completely disable storing the _source [1] and you can also return only selected fields [2] in the search results.


However, your request to disable storing only the attachments base64 data might be reasonable. You are probably not the only user requesting this. On the other hand, this can make things more complicated later because the compete document source may not be available for re-indexing. This is probably up to Shay whether he wants to allow this or not, you can always clone mapper-attachments plugin and do your customizations and try to sent pull request.

just my 2 cents.

Regards,
Lukas


On Wed, Oct 12, 2011 at 8:39 PM, Mike Gaffney <[hidden email]> wrote:
I never got the attachment system to stop storing the full document
base64 in the db or from returning it in results. The results isn't
that big of a deal, but I can't really store all of the documents 2x
(once for real and once in the index). We're already harddrive
constrained as we are.

Shay do you have any thoughts?

On Sep 29, 4:12 pm, Mike Gaffney <[hidden email]> wrote:
> Done. I get this log output:
>
> 20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] TRACE
> org.elasticsearch.index.shard.service  - [White Tiger] [docs][4]
> refresh with waitForOperations[false]
> 20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
> org.elasticsearch.index.gateway  - [White Tiger] [docs][4] recovery
> completed from local, took [6ms]
>     index    : files           [0] with total_size [0b], took[1ms]
>              : recovered_files [0] with total_size [0b]
>              : reusing_files   [0] with total_size [0b]
>     translog : number_of_operations [0], took [6ms]
> 20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
> org.elasticsearch.cluster.action.shard  - [White Tiger] sending shard
> started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
> s[INITIALIZING], reason [after recovery from gateway]
> 20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
> org.elasticsearch.cluster.action.shard  - [White Tiger] received shard
> started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
> s[INITIALIZING], reason [after recovery from gateway]
> 20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
> pool-11-thread-1] DEBUG org.elasticsearch.cluster.metadata  - [White
> Tiger] [docs] create_mapping [doc] with source [{"doc":{"properties":
> {"contents":{"type":"attachment","path":"full","fields":{"contents":
> {"type":"string"},"author":{"type":"string"},"title":
> {"type":"string"},"date":
> {"type":"date","format":"dateOptionalTime"},"keywords":
> {"type":"string"},"content_type":{"type":"string"}}},"lastModified":
> {"type":"long","index":"analyzed"}}}}]
> ... added
> 20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
> pool-11-thread-1] TRACE org.elasticsearch.cluster.service  - [White
> Tiger] cluster state updated:
> version [5], source [put-mapping [doc]]
> nodes:
>
> with this config:
>
> {
>         "docs":{
>                 "properties" : {
>                     "contents": {
>                         "type" : "attachment",
>                         "path":"full",
>                         "store": "no",
>                         "fields":{
>                                 "contents":{"type":"string", "store": "no",
> "index":"analyzed"},
>                                 "author":{"type":"string"},
>                                 "title":{"type":"string"},
>                                 "date":
> {"type":"date","store":"no","format":"dateOptionalTime"},
>                                 "keywords":{"type":"string"},
>                                 "content_type":{"type":"string"}
>                         }
>                     },
>                         "lastModified": { "type" : "long", "index" : "analyzed", "store" :
> "no"}
>                 }
>         }
>
>
>
>
>
>
>
> }


Reply | Threaded
Open this post in threaded view
|

Re: Attachment Plugin Questions on Storing

Mike Gaffney
Thanks for the clarification guys! Good enough for what I'm doing

On Oct 12, 2:32 pm, Shay Banon <[hidden email]> wrote:

> Currently, you can disable _source and thus the attachment will not be
> stored as well. There is no option to "remove" the attachment from _source
> (the json doc) and store in the _source everything *but* the attachment.
>
>
>
>
>
>
>
> On Wed, Oct 12, 2011 at 9:22 PM, Lukáš Vlček <[hidden email]> wrote:
> > Hi,
>
> > may be there are other possibilities, but, you can completely disable
> > storing the _source [1] and you can also return only selected fields [2] in
> > the search results.
>
> > [1]http://www.elasticsearch.org/guide/reference/mapping/source-field.html
> > [2]http://www.elasticsearch.org/guide/reference/api/search/fields.html
>
> > However, your request to disable storing only the attachments base64 data
> > might be reasonable. You are probably not the only user requesting this. On
> > the other hand, this can make things more complicated later because the
> > compete document source may not be available for re-indexing. This is
> > probably up to Shay whether he wants to allow this or not, you can always
> > clone mapper-attachments plugin and do your customizations and try to sent
> > pull request.
>
> > just my 2 cents.
>
> > Regards,
> > Lukas
>
> > On Wed, Oct 12, 2011 at 8:39 PM, Mike Gaffney <[hidden email]> wrote:
>
> >> I never got the attachment system to stop storing the full document
> >> base64 in the db or from returning it in results. The results isn't
> >> that big of a deal, but I can't really store all of the documents 2x
> >> (once for real and once in the index). We're already harddrive
> >> constrained as we are.
>
> >> Shay do you have any thoughts?
>
> >> On Sep 29, 4:12 pm, Mike Gaffney <[hidden email]> wrote:
> >> > Done. I get this log output:
>
> >> > 20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] TRACE
> >> > org.elasticsearch.index.shard.service  - [White Tiger] [docs][4]
> >> > refresh with waitForOperations[false]
> >> > 20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
> >> > org.elasticsearch.index.gateway  - [White Tiger] [docs][4] recovery
> >> > completed from local, took [6ms]
> >> >     index    : files           [0] with total_size [0b], took[1ms]
> >> >              : recovered_files [0] with total_size [0b]
> >> >              : reusing_files   [0] with total_size [0b]
> >> >     translog : number_of_operations [0], took [6ms]
> >> > 20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
> >> > org.elasticsearch.cluster.action.shard  - [White Tiger] sending shard
> >> > started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
> >> > s[INITIALIZING], reason [after recovery from gateway]
> >> > 20110929-15:02:25 [elasticsearch[cached]-pool-1-thread-2] DEBUG
> >> > org.elasticsearch.cluster.action.shard  - [White Tiger] received shard
> >> > started for [docs][4], node[LK1HamGVSqCXrOxh7zr8yA], [P],
> >> > s[INITIALIZING], reason [after recovery from gateway]
> >> > 20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
> >> > pool-11-thread-1] DEBUG org.elasticsearch.cluster.metadata  - [White
> >> > Tiger] [docs] create_mapping [doc] with source [{"doc":{"properties":
> >> > {"contents":{"type":"attachment","path":"full","fields":{"contents":
> >> > {"type":"string"},"author":{"type":"string"},"title":
> >> > {"type":"string"},"date":
> >> > {"type":"date","format":"dateOptionalTime"},"keywords":
> >> > {"type":"string"},"content_type":{"type":"string"}}},"lastModified":
> >> > {"type":"long","index":"analyzed"}}}}]
> >> > ... added
> >> > 20110929-15:02:25 [elasticsearch[White Tiger]clusterService#updateTask-
> >> > pool-11-thread-1] TRACE org.elasticsearch.cluster.service  - [White
> >> > Tiger] cluster state updated:
> >> > version [5], source [put-mapping [doc]]
> >> > nodes:
>
> >> > with this config:
>
> >> > {
> >> >         "docs":{
> >> >                 "properties" : {
> >> >                     "contents": {
> >> >                         "type" : "attachment",
> >> >                         "path":"full",
> >> >                         "store": "no",
> >> >                         "fields":{
> >> >                                 "contents":{"type":"string", "store":
> >> "no",
> >> > "index":"analyzed"},
> >> >                                 "author":{"type":"string"},
> >> >                                 "title":{"type":"string"},
> >> >                                 "date":
> >> > {"type":"date","store":"no","format":"dateOptionalTime"},
> >> >                                 "keywords":{"type":"string"},
> >> >                                 "content_type":{"type":"string"}
> >> >                         }
> >> >                     },
> >> >                         "lastModified": { "type" : "long", "index" :
> >> "analyzed", "store" :
> >> > "no"}
> >> >                 }
> >> >         }
>
> >> > }