Quantcast

Opinions on ES with highly related data

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Opinions on ES with highly related data

knacktus
Hi guys,

I've just discovered the potential of ES as scalable multi-purpose cache or even only data store. So far, I've been using RDBMS with MemcacheD or Redis (for simple queries in the application layer). I've decided to give ES a try by building a Prototype, but before I dive in I'd much appreciate your opinions about how I plan to get the data from ES.

The issue might be that my data is highly related and I need to work mainly with large structures. ES's main task in this regard would be to support a server process collecting all data items which are within the large structures. These items are send to a rich client, where the actual structured views are build.

Here's some example data:

root = {id = 1,
        name
= "Plane",
        subassemblies
= [2, 3, 4]}

body
= {id = 2,
        name
= "Body",
        subassemblies
= [5, 6]}

left_wing
= {id = 3,
             name
= "Wing"
             subassemlies
= []} 

right_wing = {id = 4,
             name
= "Wing"
             subassemlies
= []}  

uppder_body_structure = {id = 5,
             name
= "Upper Body"
             subassemlies
= []}  

lower_body_structure = {id = 6,
             name
= "Lower Body"
             subassemlies
= []}  
 

So, I would query ES iteratively to get all items, starting with the root item. About like this in Python pseudocode:

all_item_ids = []
current_root_id = 1
all_item_ids.append(current_root_id)
current_item_ids = [1]

while len(current_item_ids) > 0:
   current_item_ids = query_ES_for_items_by_given_ids_and_return_given_field(current_item_ids, "subassemlies") # here would come some more advanced query options
   all_item_ids.extend(current_item_ids)

send_ids_to_client(all_item_ids) # there's a client cache for the item data, so I send the ids only

The amount of data is quite large. Up to 100000 rows with up to 50 levels. So I would possibly end up with queries with 10000 arguments (however only exact matches need to be considered). Those could be split up into batches, but that's where I hope to get your opinions. (Hitting ES 50 times wouldn't make me nervous, but when it comes to a couple of thousand times, something seems not right. But then, if it took only a couple of seconds overall, I wouldn't complain :-)).

Is this the right approach to handle large structures? Do you see any general showstoppers or flaws? (Like limits in query-size ...)

Another question is about storing Thrift oder Protocol Buffers encoded data. How would you store those for simple get, mget operations? (Those formats are used for transport and in the client cache, which is basically a key-value store.)

On top of that I would use fulltext search and general combinded searches within the whole data. But I have no doubt that ES is the right choice there. So, if I'd be able to retrieve the structured data in a performant way, ES would be an awesome powerfull all-in-one solution.

Cheers and thanks for any comments and opinions,

Jan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Opinions on ES with highly related data

phill
ES is for storing _documents_  it appears what you are trying to do is store a hierarchy as you would represent it in a RDMS.  Try to build the document that you want to retrieve, because that is what you will get, one or more documents.  I doubt you ever want the query to return just
body = {id = 2,
        name
= "Body",
        subassemblies
= [5, 6]}

-Paul

On 2/15/2013 6:47 PM, [hidden email] wrote:
Hi guys,

I've just discovered the potential of ES as scalable multi-purpose cache or even only data store. So far, I've been using RDBMS with MemcacheD or Redis (for simple queries in the application layer). I've decided to give ES a try by building a Prototype, but before I dive in I'd much appreciate your opinions about how I plan to get the data from ES.

The issue might be that my data is highly related and I need to work mainly with large structures. ES's main task in this regard would be to support a server process collecting all data items which are within the large structures. These items are send to a rich client, where the actual structured views are build.

Here's some example data:

root = {id = 1,
        name
= "Plane",
        subassemblies
= [2, 3, 4]}

body
= {id = 2,
        name
= "Body",
        subassemblies
= [5, 6]}

left_wing
= {id = 3,
             name
= "Wing"
             subassemlies
= []} 

right_wing = {id = 4,
             name
= "Wing"
             subassemlies
= []}  

uppder_body_structure = {id = 5,
             name
= "Upper Body"
             subassemlies
= []}  

lower_body_structure = {id = 6,
             name
= "Lower Body"
             subassemlies
= []}  
 

So, I would query ES iteratively to get all items, starting with the root item. About like this in Python pseudocode:

all_item_ids = []
current_root_id = 1
all_item_ids.append(current_root_id)
current_item_ids = [1]

while len(current_item_ids) > 0:
   current_item_ids = query_ES_for_items_by_given_ids_and_return_given_field(current_item_ids, "subassemlies") # here would come some more advanced query options
   all_item_ids.extend(current_item_ids)

send_ids_to_client(all_item_ids) # there's a client cache for the item data, so I send the ids only

The amount of data is quite large. Up to 100000 rows with up to 50 levels. So I would possibly end up with queries with 10000 arguments (however only exact matches need to be considered). Those could be split up into batches, but that's where I hope to get your opinions. (Hitting ES 50 times wouldn't make me nervous, but when it comes to a couple of thousand times, something seems not right. But then, if it took only a couple of seconds overall, I wouldn't complain :-)).

Is this the right approach to handle large structures? Do you see any general showstoppers or flaws? (Like limits in query-size ...)

Another question is about storing Thrift oder Protocol Buffers encoded data. How would you store those for simple get, mget operations? (Those formats are used for transport and in the client cache, which is basically a key-value store.)

On top of that I would use fulltext search and general combinded searches within the whole data. But I have no doubt that ES is the right choice there. So, if I'd be able to retrieve the structured data in a performant way, ES would be an awesome powerfull all-in-one solution.

Cheers and thanks for any comments and opinions,

Jan
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Loading...