ElasticSearch built-in Jackson stream parser is fastest way to extract fields

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

ElasticSearch built-in Jackson stream parser is fastest way to extract fields

InquiringMind
Just an FYI... Start with, for example, the following JSON document (all on one line for the _bulk API, but pretty printed below). This follows my basic document struture: An array of field names, whith each of those fields taking either a single value or an array of heterogenous values. Nothing more complex than a Map<String,Object> can represent, in which Object is either a single type (String, Boolean, and so on) or an Array<Object>. A subset of the "throw any JSON document into ES", but still a very useful subset that far exceeds any database engine I've ever used:

{
  "_index" : "twitter" ,
  "_type" : "tweet" ,
  "_id" : "3" ,
  "_score" : 1.0 ,
  "_source" : {
     "user" : "bbthing68" ,
     "postDate" : "2012-11-15T14:12:12" ,
     "altitude" : 45767 ,
     "dst" : true ,
     "prefix" : null ,
     "counts" : [ 1 , 2 , 3.14149 , "11.1" , "13" ] ,
     "vdst" : [ true , false , true ] ,
     "message" : [ 2 , "Just trying this out" , "With one/two multivalued fields" ]
   }
}

Both the SearchHit.getSourceAsString and the GetResponse.getSourceAsString methods return the following JSON string (again, it's on one line, but it's pretty printed here only for this post):

{
  "user" : "bbthing68" ,
  "postDate" : "2012-11-15T14:12:12" ,
  "altitude" : 45767 ,
  "dst" : true ,
  "prefix" : null ,
  "counts" : [ 1 , 2 , 3.14149 , "11.1" , "13" ] ,
  "vdst" : [ true , false , true ] ,
  "message" : [ 2 , "Just trying this out" , "With one/two multivalued fields" ]
}

I was using the getSourceAsMap methods, which return a Map<String,Object>. But when I use the JsonParser in stream parsing mode (as supplied directly by ElasticSearch; no need to fetch the full Jackson jar file), I can directly stream parse that source so very much faster. My overall response times are now much lower. And it's also much easier and faster for me to just parse the source and pull out only the subset of the fields I want instead of try to tell ES which subset of fields I want.

Oh, and when I store the fields from my stream parsing process, I put them into a LinkedHashMap<String,Object>. That little bit of overhead keeps the keys (field names) in the exact same order as they appear in the source. Which is really awesomely cool. No more jumbled order of field names when displaying results during testing!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch built-in Jackson stream parser is fastest way to extract fields

joergprante@gmail.com
To shed some light on this, the code behind getSourceAsMap() uses some
format detection and decompressing in XContentHelper. Did you disable
source compression? It is enabled by default.

Jörg

Am 12.03.13 16:26, schrieb InquiringMind:
> I was using the getSourceAsMap methods, which return a
> Map<String,Object>. But when I use the JsonParser in stream parsing
> mode (as supplied directly by ElasticSearch; no need to fetch the full
> Jackson jar file), I can directly stream parse that source */so very
> much faster/*

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch built-in Jackson stream parser is fastest way to extract fields

InquiringMind
I completely ignored any settings related to source compression, letting it default to whatever value it has in 19.4, 19.10, and 20.4 (the ES versions I've used).

I originally prototyped my stream parsing of getSourceAsString because I wanted the field order preserved using a LinkedHashMap.

To my surprise, my stream parser is much faster.

Of course, I use the JsonParser.nextValue method, not its nextToken method, for my stream parsing implementation. It seems to greatly simplify the parsing code. Jackson's rather fast. Because ES only made the stream parser available it has rather forced me to stream-parse instead of relying on its Tree Model or OJM. I don't know if they have any performance overheads relative to what I have to do when stream parsing, but I've adapted and it works very well.

On Tuesday, March 12, 2013 3:02:07 PM UTC-4, Jörg Prante wrote:
To shed some light on this, the code behind getSourceAsMap() uses some
format detection and decompressing in XContentHelper. Did you disable
source compression? It is enabled by default.

Jörg

Am 12.03.13 16:26, schrieb InquiringMind:
> I was using the getSourceAsMap methods, which return a
> Map<String,Object>. But when I use the JsonParser in stream parsing
> mode (as supplied directly by ElasticSearch; no need to fetch the full
> Jackson jar file), I can directly stream parse that source */so very
> much faster/*

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch built-in Jackson stream parser is fastest way to extract fields

joergprante@gmail.com
There is almost nothing faster in Java JSON parsing than Jackson in
streaming mode since it uses a highly optimized parser. Note, if you
just use plain JSON (and not SMILE or compressed JSON) you can add
Jackson libs to your project and also use TreeModel or ObjectMapper.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch built-in Jackson stream parser is fastest way to extract fields

Klaus Brunner
In reply to this post by InquiringMind
On Tuesday, 12 March 2013 16:26:47 UTC+1, InquiringMind wrote:
I was using the getSourceAsMap methods, which return a Map<String,Object>. But when I use the JsonParser in stream parsing mode (as supplied directly by ElasticSearch; no need to fetch the full Jackson jar file), I can directly stream parse that source so very much faster. My overall response times are now much lower. And it's also much easier and faster for me to just parse the source and pull out only the subset of the fields I want instead of try to tell ES which subset of fields I want.

Sounds interesting, but I'm not quite sure what you're doing (and why it's faster). Do you get a BytesReference via SearchHit.sourceRef() first, and then let a JsonParser operate on the return value of its .streamInput()?

I currently use SearchHit.source() - which returns a byte[] via BytesReference.bytes() - and then parse that with Jackson, but if there's a way to save an array copy in the whole process, it would be nice. Needs to be compression-safe though.

Klaus

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch built-in Jackson stream parser is fastest way to extract fields

InquiringMind

Sounds interesting, but I'm not quite sure what you're doing (and why it's faster).

It's faster (almost 3 times as fast) as it is to use the getSourceAsMap method.

I had tried to limit and extract by individual field back when I started with 19.4, but just couldn't get it to work. I've since mastered enough of the index settings and mappings, but have stayed with extracting the _source.
 
Do you get a BytesReference via SearchHit.sourceRef() first, and then let a JsonParser operate on the return value of its .streamInput()?

No, I get a String via SearchHit.getSourceAsString() first, and then let a JsonParser operate on it. I wasn't sure what the BytesReference was: Compressed? Not compressed?  UTF-8? It didn't seem deterministic enough for me based on the documentation. And when parsing the source as either a String (or as the originally used Map<String,Object> returned via getSourceAsMap), I never had any problems. Even all of my Chinese characters came out perfectly.

I currently use SearchHit.source() - which returns a byte[] via BytesReference.bytes() - and then parse that with Jackson, but if there's a way to save an array copy in the whole process, it would be nice. Needs to be compression-safe though.

I agree. I just wasn't sure about what I might need to do with compression. I assume that getSourceAsString already knows what to do based on how the _source was stored. At least, that's been my experience so far. And when migrating from 19.4 to 19.10 and now to 20.4, I've seen no issues at all 20.4 using databases built with 19.4 and updated with 10.4, 19.10, and 20.4. Migration has been painless and smooth.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch built-in Jackson stream parser is fastest way to extract fields

InquiringMind
One more thing I forget to mention: When I parse the JSON getSourceAsString myself, I store the fields I want into a LinkedHashMap instead of a simple HashMap.

That tiny amount of overhead ensures that I can iterate across the map and show the fields in the exact same order that they are stored in the _source. Using the map returned by the getSourceAsMap method, the fields are jumbled and it makes it more difficult to verify results by eye.

And a small update to my previous post below in blue...

On Wednesday, March 13, 2013 2:37:17 PM UTC-4, InquiringMind wrote:

Sounds interesting, but I'm not quite sure what you're doing (and why it's faster).

It's faster (almost 3 times as fast) as it is to use the getSourceAsMap method.

I had tried to limit and extract by individual field back when I started with 19.4, but just couldn't get it to work. I've since mastered enough of the index settings and mappings, but have stayed with extracting the _source.
 
Do you get a BytesReference via SearchHit.sourceRef() first, and then let a JsonParser operate on the return value of its .streamInput()?

No, I get a String via SearchHit.getSourceAsString() first, and then let a JsonParser operate on it. I wasn't sure what the BytesReference was: Compressed? Not compressed?  UTF-8? It didn't seem deterministic enough for me based on the documentation. And when parsing the source as either a String (or as the originally used Map<String,Object> returned via getSourceAsMap), I never had any problems. Even all of my Chinese characters came out perfectly.

I currently use SearchHit.source() - which returns a byte[] via BytesReference.bytes() - and then parse that with Jackson, but if there's a way to save an array copy in the whole process, it would be nice. Needs to be compression-safe though.

I agree. I just wasn't sure about what I might need to do with compression. I assume that getSourceAsString already knows what to do based on how the _source was stored. At least, that's been my experience so far. And when migrating from 19.4 to 19.10 and now to 20.4, I've seen no issues at all with 20.4 using databases built with ES 19.4 and updated with ES versions 19.4, 19.10, and 20.4. Migration has been painless and smooth.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch built-in Jackson stream parser is fastest way to extract fields

Swati Jain
In reply to this post by InquiringMind
Hi Brian,

I am new to the elasticSearch and currently working with version 1.5.1. I am working on parsing a json string returned by getSourceAsString() method into an object. JsonParser class in elasticsearch (import org.elasticsearch.common.jackson.core.JsonParser) doesn't allow me to create an instance of it to use it's method readValueAs(class name).

I would like to use elasticsearch apis as much as I can instead of including Jackson jar separately in my code. Can you please show me the JsonParser code you used to convert a string into an object ??

Thanks,
Swati

On Tuesday, March 12, 2013 at 11:26:47 AM UTC-4, Brian wrote:
Just an FYI... Start with, for example, the following JSON document (all on one line for the _bulk API, but pretty printed below). This follows my basic document struture: An array of field names, whith each of those fields taking either a single value or an array of heterogenous values. Nothing more complex than a Map<String,Object> can represent, in which Object is either a single type (String, Boolean, and so on) or an Array<Object>. A subset of the "throw any JSON document into ES", but still a very useful subset that far exceeds any database engine I've ever used:

{
  "_index" : "twitter" ,
  "_type" : "tweet" ,
  "_id" : "3" ,
  "_score" : 1.0 ,
  "_source" : {
     "user" : "bbthing68" ,
     "postDate" : "2012-11-15T14:12:12" ,
     "altitude" : 45767 ,
     "dst" : true ,
     "prefix" : null ,
     "counts" : [ 1 , 2 , 3.14149 , "11.1" , "13" ] ,
     "vdst" : [ true , false , true ] ,
     "message" : [ 2 , "Just trying this out" , "With one/two multivalued fields" ]
   }
}

Both the SearchHit.getSourceAsString and the GetResponse.getSourceAsString methods return the following JSON string (again, it's on one line, but it's pretty printed here only for this post):

{
  "user" : "bbthing68" ,
  "postDate" : "2012-11-15T14:12:12" ,
  "altitude" : 45767 ,
  "dst" : true ,
  "prefix" : null ,
  "counts" : [ 1 , 2 , 3.14149 , "11.1" , "13" ] ,
  "vdst" : [ true , false , true ] ,
  "message" : [ 2 , "Just trying this out" , "With one/two multivalued fields" ]
}

I was using the getSourceAsMap methods, which return a Map<String,Object>. But when I use the JsonParser in stream parsing mode (as supplied directly by ElasticSearch; no need to fetch the full Jackson jar file), I can directly stream parse that source so very much faster. My overall response times are now much lower. And it's also much easier and faster for me to just parse the source and pull out only the subset of the fields I want instead of try to tell ES which subset of fields I want.

Oh, and when I store the fields from my stream parsing process, I put them into a LinkedHashMap<String,Object>. That little bit of overhead keeps the keys (field names) in the exact same order as they appear in the source. Which is really awesomely cool. No more jumbled order of field names when displaying results during testing!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fd12f55c-0b4a-4c37-b639-98d686112d0f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch built-in Jackson stream parser is fastest way to extract fields

InquiringMind
Swati,

Well, I tend not to use the built-in Jackson parser anymore. The only advantage I've seen to stream parsing is that I can dynamically adapt to different objects in my own code. But I can't release the code since it's owned by my employer. And for most tasks these days, I use the Jackson jar files and the data binding model. By the way, here are the only additional JAR files that I use in my Elasticsearch-based tools that also include Elasticsearch jars:

For the full Jackson support. There are later versions but these work for now until the rest of the company moves to Java 8:

jackson-annotations-2.2.3.jar
jackson-core-2.2.3.jar
jackson-databind-2.2.3.jar

This gives me the full Netty server (got tired of looking for it buried inside ES, and found this to be very simple and easy to use). Again, there are later versions but this one works well enough:

netty-3.5.8.Final.jar

And this is the magic that brings Netty to life. My front end simply publishes each incoming Netty MessageEvent to the LMAX Disruptor ring buffer. Then I can predefine a fixed number of background WorkHandler threads to consume the MessageEvent objects, handling each one and responding back to its client. No matter how much load is slammed into the front end, the number of Netty threads stays small since they only publish and they're done. And so, the total thread count stays small even when intense bursts of clients slam the server:

disruptor-3.2.0.jar

I hope this helps. I'd love to publish more details but this is about all I can do for now.

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/773a4516-89c1-4e21-bd65-e5e7bf48c7e4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: ElasticSearch built-in Jackson stream parser is fastest way to extract fields

Swati Jain
Thanks Brian for your reply. I have started using the jars that you suggested here and they are easy to work with. Thank you.

Swati

On Thursday, April 30, 2015 at 5:47:43 PM UTC-4, Brian wrote:
Swati,

Well, I tend not to use the built-in Jackson parser anymore. The only advantage I've seen to stream parsing is that I can dynamically adapt to different objects in my own code. But I can't release the code since it's owned by my employer. And for most tasks these days, I use the Jackson jar files and the data binding model. By the way, here are the only additional JAR files that I use in my Elasticsearch-based tools that also include Elasticsearch jars:

For the full Jackson support. There are later versions but these work for now until the rest of the company moves to Java 8:

jackson-annotations-2.2.3.jar
jackson-core-2.2.3.jar
jackson-databind-2.2.3.jar

This gives me the full Netty server (got tired of looking for it buried inside ES, and found this to be very simple and easy to use). Again, there are later versions but this one works well enough:

netty-3.5.8.Final.jar

And this is the magic that brings Netty to life. My front end simply publishes each incoming Netty MessageEvent to the LMAX Disruptor ring buffer. Then I can predefine a fixed number of background WorkHandler threads to consume the MessageEvent objects, handling each one and responding back to its client. No matter how much load is slammed into the front end, the number of Netty threads stays small since they only publish and they're done. And so, the total thread count stays small even when intense bursts of clients slam the server:

disruptor-3.2.0.jar

I hope this helps. I'd love to publish more details but this is about all I can do for now.

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/32783ea6-faaf-4a77-9dfa-31be451939c1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.