How to resolve NumberFormatException issues caused by an empty string

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

How to resolve NumberFormatException issues caused by an empty string

pulkitsinghal
BTW, please forgive me in advance for even mentioning the word Solr in
this forum because I know ES folks cringe at comparisons between the
two technologies. I understand they are different and I am simply
making an analogy for the "Data Input & Indexing behavior" angle ...
so bear with me here.

The stacktrace from the ES server's NFE is at the end of this thread.

I have faced similar NumberFormatException issues before in Solr as
well. I think these happen simply because the underlying Lucene isn't
ready to accept/ignore an empty string for numbers or date/time data.
So I am assuming that this is no different for ES which is built atop
Lucene as well. (1) Let me know if you agree with me so far.

In Solr, I got around this by having its Data Import Handler run
scripts on the incoming documents to either place a number like -1 as
a placeholder or by removing the field explicitly from the document
construction.

So with ES, I was hoping it would be more straightforward. My feed in
ES is the magical and much revered CouchDB river :) And I try not to
define the mappings myself because ES does such a great job of
figuring them out and it is one of the many many many conveniences of
ES that I want to take advantage of.

I was hoping that ES would acknowledge the fact that letting empty
strings through (for core type fields like number, date and time) has
no merit and would simply ignore the empty values. (2) Is this a "bad"
thing to hope for?

The data that failed looks like:
"shipping" :
[
 {
   "nextDay" : "",
   "vendorDelivery":69.99,
   "ground" : "",
   "secondDay":""
 }
]
So imagine my surprise at how well ES did, in order to be able to
guess that shipping.nextDay was supposed to be a number! But then not
ignoring the junk pumped into it as an empty string.

(2) I'm not bad mouthing ES, I'm asking: Can we expect ES to tackle
this or would we be wrong to place such an expectation on ES?

(3) If the data appropriately had a null value then ES would have
handled it already because when there is a (JSON) null value for the
field and the null_value has not been setup then ES defaults to not
adding the field at all. That is not the case here so what would the
workaround be? If any? Sanitize my data? Oh lord the tears are rolling
down my cheeks, please say that's not my only option.

Please let me know what you think.

=== STACKTRACE ====
org.elasticsearch.index.mapper.MapperParsingException: Failed to parse
[shipping.nextDay]
        at
org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:
312)
        at
org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:
577)
        at
org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:
443)
        at
org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:
491)
        at
org.elasticsearch.index.mapper.object.ObjectMapper.serializeArray(ObjectMapper.java:
557)
        at
org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:
435)
        at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
567)
        at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
491)
        at
org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:
289)
        at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:
131)
        at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:
464)
        at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
$AsyncShardOperationAction
$1.run(TransportShardReplicationOperationAction.java:377)
        at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:680)

Caused by: java.lang.NumberFormatException: empty String
        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:
992)
        at java.lang.Double.parseDouble(Double.java:510)
        at
org.elasticsearch.common.xcontent.support.AbstractXContentParser.doubleValue(AbstractXContentParser.java:
88)
        at
org.elasticsearch.index.mapper.core.DoubleFieldMapper.parseCreateField(DoubleFieldMapper.java:
227)
        at
org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:
299)
        ... 14 more

Reply | Threaded
Open this post in threaded view
|

Re: How to resolve NumberFormatException issues caused by an empty string

kimchy
Administrator
This actually has nothing to do with Lucene, but how elasticsearch handles deriving field types and handing "" text for numeric values.

First, deriving a type for a field. If the field is first introduced, then the type is derived based on its value. This will not work well if the first document introducing nextDay will be an empty string, since the type for the field will be string, and not a number (long / double).

As for empty text, then yes, it will fail to index the doc if an empty text is provided and its a numeric type. As you mentioned a null value for the field is what it handles, and does not handle empty text as null value.

On Tue, Oct 18, 2011 at 9:29 PM, pulkitsinghal <[hidden email]> wrote:
BTW, please forgive me in advance for even mentioning the word Solr in
this forum because I know ES folks cringe at comparisons between the
two technologies. I understand they are different and I am simply
making an analogy for the "Data Input & Indexing behavior" angle ...
so bear with me here.

The stacktrace from the ES server's NFE is at the end of this thread.

I have faced similar NumberFormatException issues before in Solr as
well. I think these happen simply because the underlying Lucene isn't
ready to accept/ignore an empty string for numbers or date/time data.
So I am assuming that this is no different for ES which is built atop
Lucene as well. (1) Let me know if you agree with me so far.

In Solr, I got around this by having its Data Import Handler run
scripts on the incoming documents to either place a number like -1 as
a placeholder or by removing the field explicitly from the document
construction.

So with ES, I was hoping it would be more straightforward. My feed in
ES is the magical and much revered CouchDB river :) And I try not to
define the mappings myself because ES does such a great job of
figuring them out and it is one of the many many many conveniences of
ES that I want to take advantage of.

I was hoping that ES would acknowledge the fact that letting empty
strings through (for core type fields like number, date and time) has
no merit and would simply ignore the empty values. (2) Is this a "bad"
thing to hope for?

The data that failed looks like:
"shipping" :
[
 {
  "nextDay" : "",
  "vendorDelivery":69.99,
  "ground" : "",
  "secondDay":""
 }
]
So imagine my surprise at how well ES did, in order to be able to
guess that shipping.nextDay was supposed to be a number! But then not
ignoring the junk pumped into it as an empty string.

(2) I'm not bad mouthing ES, I'm asking: Can we expect ES to tackle
this or would we be wrong to place such an expectation on ES?

(3) If the data appropriately had a null value then ES would have
handled it already because when there is a (JSON) null value for the
field and the null_value has not been setup then ES defaults to not
adding the field at all. That is not the case here so what would the
workaround be? If any? Sanitize my data? Oh lord the tears are rolling
down my cheeks, please say that's not my only option.

Please let me know what you think.

=== STACKTRACE ====
org.elasticsearch.index.mapper.MapperParsingException: Failed to parse
[shipping.nextDay]
       at
org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:
312)
       at
org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:
577)
       at
org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:
443)
       at
org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:
491)
       at
org.elasticsearch.index.mapper.object.ObjectMapper.serializeArray(ObjectMapper.java:
557)
       at
org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:
435)
       at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
567)
       at
org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
491)
       at
org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:
289)
       at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:
131)
       at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:
464)
       at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
$AsyncShardOperationAction
$1.run(TransportShardReplicationOperationAction.java:377)
       at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
       at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
       at java.lang.Thread.run(Thread.java:680)

Caused by: java.lang.NumberFormatException: empty String
       at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:
992)
       at java.lang.Double.parseDouble(Double.java:510)
       at
org.elasticsearch.common.xcontent.support.AbstractXContentParser.doubleValue(AbstractXContentParser.java:
88)
       at
org.elasticsearch.index.mapper.core.DoubleFieldMapper.parseCreateField(DoubleFieldMapper.java:
227)
       at
org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:
299)
       ... 14 more


Reply | Threaded
Open this post in threaded view
|

Re: How to resolve NumberFormatException issues caused by an empty string

pulkitsinghal
Hello Shay,

Thanks for the info!

May I ask: "Does it make sense for ES to handle an empty string as it
would handle a null value, once it has already derived that a field is
numeric based on the first value introduced to the system?"

Thoughts?
- Pulkit

On Oct 18, 2:37 pm, Shay Banon <[hidden email]> wrote:

> This actually has nothing to do with Lucene, but how elasticsearch handles
> deriving field types and handing "" text for numeric values.
>
> First, deriving a type for a field. If the field is first introduced, then
> the type is derived based on its value. This will not work well if the first
> document introducing nextDay will be an empty string, since the type for the
> field will be string, and not a number (long / double).
>
> As for empty text, then yes, it will fail to index the doc if an empty text
> is provided and its a numeric type. As you mentioned a null value for the
> field is what it handles, and does not handle empty text as null value.
>
> On Tue, Oct 18, 2011 at 9:29 PM, pulkitsinghal <[hidden email]>wrote:
>
>
>
>
>
>
>
> > BTW, please forgive me in advance for even mentioning the word Solr in
> > this forum because I know ES folks cringe at comparisons between the
> > two technologies. I understand they are different and I am simply
> > making an analogy for the "Data Input & Indexing behavior" angle ...
> > so bear with me here.
>
> > The stacktrace from the ES server's NFE is at the end of this thread.
>
> > I have faced similar NumberFormatException issues before in Solr as
> > well. I think these happen simply because the underlying Lucene isn't
> > ready to accept/ignore an empty string for numbers or date/time data.
> > So I am assuming that this is no different for ES which is built atop
> > Lucene as well. (1) Let me know if you agree with me so far.
>
> > In Solr, I got around this by having its Data Import Handler run
> > scripts on the incoming documents to either place a number like -1 as
> > a placeholder or by removing the field explicitly from the document
> > construction.
>
> > So with ES, I was hoping it would be more straightforward. My feed in
> > ES is the magical and much revered CouchDB river :) And I try not to
> > define the mappings myself because ES does such a great job of
> > figuring them out and it is one of the many many many conveniences of
> > ES that I want to take advantage of.
>
> > I was hoping that ES would acknowledge the fact that letting empty
> > strings through (for core type fields like number, date and time) has
> > no merit and would simply ignore the empty values. (2) Is this a "bad"
> > thing to hope for?
>
> > The data that failed looks like:
> > "shipping" :
> > [
> >  {
> >   "nextDay" : "",
> >   "vendorDelivery":69.99,
> >   "ground" : "",
> >   "secondDay":""
> >  }
> > ]
> > So imagine my surprise at how well ES did, in order to be able to
> > guess that shipping.nextDay was supposed to be a number! But then not
> > ignoring the junk pumped into it as an empty string.
>
> > (2) I'm not bad mouthing ES, I'm asking: Can we expect ES to tackle
> > this or would we be wrong to place such an expectation on ES?
>
> > (3) If the data appropriately had a null value then ES would have
> > handled it already because when there is a (JSON) null value for the
> > field and the null_value has not been setup then ES defaults to not
> > adding the field at all. That is not the case here so what would the
> > workaround be? If any? Sanitize my data? Oh lord the tears are rolling
> > down my cheeks, please say that's not my only option.
>
> > Please let me know what you think.
>
> > === STACKTRACE ====
> > org.elasticsearch.index.mapper.MapperParsingException: Failed to parse
> > [shipping.nextDay]
> >        at
>
> > org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:
> > 312)
> >        at
>
> > org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:
> > 577)
> >        at
> > org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:
> > 443)
> >        at
>
> > org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:
> > 491)
> >        at
>
> > org.elasticsearch.index.mapper.object.ObjectMapper.serializeArray(ObjectMapper.java:
> > 557)
> >        at
> > org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:
> > 435)
> >        at
> > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
> > 567)
> >        at
> > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
> > 491)
> >        at
>
> > org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:
> > 289)
> >        at
>
> > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:
> > 131)
> >        at
>
> > org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
>
> > $AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:
> > 464)
> >        at
>
> > org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
> > $AsyncShardOperationAction
> > $1.run(TransportShardReplicationOperationAction.java:377)
> >        at java.util.concurrent.ThreadPoolExecutor
> > $Worker.runTask(ThreadPoolExecutor.java:886)
> >        at java.util.concurrent.ThreadPoolExecutor
> > $Worker.run(ThreadPoolExecutor.java:908)
> >        at java.lang.Thread.run(Thread.java:680)
>
> > Caused by: java.lang.NumberFormatException: empty String
> >        at
> > sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:
> > 992)
> >        at java.lang.Double.parseDouble(Double.java:510)
> >        at
>
> > org.elasticsearch.common.xcontent.support.AbstractXContentParser.doubleValue(AbstractXContentParser.java:
> > 88)
> >        at
>
> > org.elasticsearch.index.mapper.core.DoubleFieldMapper.parseCreateField(DoubleFieldMapper.java:
> > 227)
> >        at
>
> > org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:
> > 299)
> >        ... 14 more
Reply | Threaded
Open this post in threaded view
|

Re: How to resolve NumberFormatException issues caused by an empty string

kimchy
Administrator
It can make sense, maybe with a special flag in the mapping, but note, you still have a problem if your first doc will be empty string, and you did not specify the mapping initially, since then it will be detected as a string type, and not a numeric type.

On Tue, Oct 18, 2011 at 11:42 PM, pulkitsinghal <[hidden email]> wrote:
Hello Shay,

Thanks for the info!

May I ask: "Does it make sense for ES to handle an empty string as it
would handle a null value, once it has already derived that a field is
numeric based on the first value introduced to the system?"

Thoughts?
- Pulkit

On Oct 18, 2:37 pm, Shay Banon <[hidden email]> wrote:
> This actually has nothing to do with Lucene, but how elasticsearch handles
> deriving field types and handing "" text for numeric values.
>
> First, deriving a type for a field. If the field is first introduced, then
> the type is derived based on its value. This will not work well if the first
> document introducing nextDay will be an empty string, since the type for the
> field will be string, and not a number (long / double).
>
> As for empty text, then yes, it will fail to index the doc if an empty text
> is provided and its a numeric type. As you mentioned a null value for the
> field is what it handles, and does not handle empty text as null value.
>
> On Tue, Oct 18, 2011 at 9:29 PM, pulkitsinghal <[hidden email]>wrote:
>
>
>
>
>
>
>
> > BTW, please forgive me in advance for even mentioning the word Solr in
> > this forum because I know ES folks cringe at comparisons between the
> > two technologies. I understand they are different and I am simply
> > making an analogy for the "Data Input & Indexing behavior" angle ...
> > so bear with me here.
>
> > The stacktrace from the ES server's NFE is at the end of this thread.
>
> > I have faced similar NumberFormatException issues before in Solr as
> > well. I think these happen simply because the underlying Lucene isn't
> > ready to accept/ignore an empty string for numbers or date/time data.
> > So I am assuming that this is no different for ES which is built atop
> > Lucene as well. (1) Let me know if you agree with me so far.
>
> > In Solr, I got around this by having its Data Import Handler run
> > scripts on the incoming documents to either place a number like -1 as
> > a placeholder or by removing the field explicitly from the document
> > construction.
>
> > So with ES, I was hoping it would be more straightforward. My feed in
> > ES is the magical and much revered CouchDB river :) And I try not to
> > define the mappings myself because ES does such a great job of
> > figuring them out and it is one of the many many many conveniences of
> > ES that I want to take advantage of.
>
> > I was hoping that ES would acknowledge the fact that letting empty
> > strings through (for core type fields like number, date and time) has
> > no merit and would simply ignore the empty values. (2) Is this a "bad"
> > thing to hope for?
>
> > The data that failed looks like:
> > "shipping" :
> > [
> >  {
> >   "nextDay" : "",
> >   "vendorDelivery":69.99,
> >   "ground" : "",
> >   "secondDay":""
> >  }
> > ]
> > So imagine my surprise at how well ES did, in order to be able to
> > guess that shipping.nextDay was supposed to be a number! But then not
> > ignoring the junk pumped into it as an empty string.
>
> > (2) I'm not bad mouthing ES, I'm asking: Can we expect ES to tackle
> > this or would we be wrong to place such an expectation on ES?
>
> > (3) If the data appropriately had a null value then ES would have
> > handled it already because when there is a (JSON) null value for the
> > field and the null_value has not been setup then ES defaults to not
> > adding the field at all. That is not the case here so what would the
> > workaround be? If any? Sanitize my data? Oh lord the tears are rolling
> > down my cheeks, please say that's not my only option.
>
> > Please let me know what you think.
>
> > === STACKTRACE ====
> > org.elasticsearch.index.mapper.MapperParsingException: Failed to parse
> > [shipping.nextDay]
> >        at
>
> > org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:
> > 312)
> >        at
>
> > org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:
> > 577)
> >        at
> > org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:
> > 443)
> >        at
>
> > org.elasticsearch.index.mapper.object.ObjectMapper.serializeObject(ObjectMapper.java:
> > 491)
> >        at
>
> > org.elasticsearch.index.mapper.object.ObjectMapper.serializeArray(ObjectMapper.java:
> > 557)
> >        at
> > org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:
> > 435)
> >        at
> > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
> > 567)
> >        at
> > org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:
> > 491)
> >        at
>
> > org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:
> > 289)
> >        at
>
> > org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:
> > 131)
> >        at
>
> > org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
>
> > $AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:
> > 464)
> >        at
>
> > org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction
> > $AsyncShardOperationAction
> > $1.run(TransportShardReplicationOperationAction.java:377)
> >        at java.util.concurrent.ThreadPoolExecutor
> > $Worker.runTask(ThreadPoolExecutor.java:886)
> >        at java.util.concurrent.ThreadPoolExecutor
> > $Worker.run(ThreadPoolExecutor.java:908)
> >        at java.lang.Thread.run(Thread.java:680)
>
> > Caused by: java.lang.NumberFormatException: empty String
> >        at
> > sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:
> > 992)
> >        at java.lang.Double.parseDouble(Double.java:510)
> >        at
>
> > org.elasticsearch.common.xcontent.support.AbstractXContentParser.doubleValue(AbstractXContentParser.java:
> > 88)
> >        at
>
> > org.elasticsearch.index.mapper.core.DoubleFieldMapper.parseCreateField(DoubleFieldMapper.java:
> > 227)
> >        at
>
> > org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:
> > 299)
> >        ... 14 more