Feedback on data model (over 1 billion documents)

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Feedback on data model (over 1 billion documents)

Nitish Sharma
Hi,
We are planning to use ES to search through almost 2 billion documents
(and growing fast). Each document has one or more social interaction
associated with it. A search should be performed on document data as
well as on social interactions linked to it. We would like to have
community feedback on the model we have chosen.

We want to be able to do the following; imagine one document with two
social interactions. One interaction mentioning 'tree' and the other
'house'. A search on 'tree AND house' would yield this document.

We are in doubt how to record social interactions. We came up with
this model and it works for our search requirement:
1. a unique URL field
2. an array of social interactions
3. a social interaction consists of several text and integer fields

(See this Gist for a more complete JSON representation:
https://gist.github.com/1751349 )

The problem is appending social interactions. For every incoming
social interaction we have to do a GET request, checking if this
particular document already exists or not. If it does append the
interaction and POST. If it doesn't create a new record and POST. Is
this a problem in terms of overhead? We think it is.
Another problem with this is that we want to have multiple processes
updating/inserting documents. If two processes want to update (or
create) the same document this will lead to inconsistencies. We know
of the version functionality of ES, should we try to harness that?

An other problem entirely is the potential size of a document. Imagine
a document having tens of thousands of social interactions. Would the
document size grow prohibitively large? We expect to search on users.
A user is recorded in a social interaction. The search would yield the
whole (huge) document (and possibly more documents), rather than
returning only his interactions. Can we do something about this? Trim
the document, for example, before returning it?

Perhaps we should choose an other data model. Your help is greatly
appreciated.

Cheers
Nitish
Reply | Threaded
Open this post in threaded view
|

Re: Feedback on data model (over 1 billion documents)

Berkay Mollamustafaoglu-2
Flattening the data model would solve your problems assuming you can live with it. 
You can add URL and rank as properties to each interaction and store each interaction as a separate document. You would get multiple docs per URL in the results, but it may be feasible to handle that in your application code. 
With this data model, you'd only do a write to ES for each interaction hence you'd get much better performance.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype


On Mon, Feb 6, 2012 at 5:26 AM, Nitish Sharma <[hidden email]> wrote:
Hi,
We are planning to use ES to search through almost 2 billion documents
(and growing fast). Each document has one or more social interaction
associated with it. A search should be performed on document data as
well as on social interactions linked to it. We would like to have
community feedback on the model we have chosen.

We want to be able to do the following; imagine one document with two
social interactions. One interaction mentioning 'tree' and the other
'house'. A search on 'tree AND house' would yield this document.

We are in doubt how to record social interactions. We came up with
this model and it works for our search requirement:
1. a unique URL field
2. an array of social interactions
3. a social interaction consists of several text and integer fields

(See this Gist for a more complete JSON representation:
https://gist.github.com/1751349 )

The problem is appending social interactions. For every incoming
social interaction we have to do a GET request, checking if this
particular document already exists or not. If it does append the
interaction and POST. If it doesn't create a new record and POST. Is
this a problem in terms of overhead? We think it is.
Another problem with this is that we want to have multiple processes
updating/inserting documents. If two processes want to update (or
create) the same document this will lead to inconsistencies. We know
of the version functionality of ES, should we try to harness that?

An other problem entirely is the potential size of a document. Imagine
a document having tens of thousands of social interactions. Would the
document size grow prohibitively large? We expect to search on users.
A user is recorded in a social interaction. The search would yield the
whole (huge) document (and possibly more documents), rather than
returning only his interactions. Can we do something about this? Trim
the document, for example, before returning it?

Perhaps we should choose an other data model. Your help is greatly
appreciated.

Cheers
Nitish

Reply | Threaded
Open this post in threaded view
|

Re: Feedback on data model (over 1 billion documents)

Dan Everton
Your data model sounds a lot like a graph. You may want to look in to
a graph database like Neo4J coupled with Lucene directly rather than
Elasticsearch.
Reply | Threaded
Open this post in threaded view
|

Re: Feedback on data model (over 1 billion documents)

kimchy
Administrator
In reply to this post by Berkay Mollamustafaoglu-2
Berkay suggestion is a good one, what I would like to know more is what type of searches will be executed? i.e. do you expect to get the URLs back, or specific user interactions?

On Monday, February 6, 2012 at 4:10 PM, Berkay Mollamustafaoglu wrote:

Flattening the data model would solve your problems assuming you can live with it. 
You can add URL and rank as properties to each interaction and store each interaction as a separate document. You would get multiple docs per URL in the results, but it may be feasible to handle that in your application code. 
With this data model, you'd only do a write to ES for each interaction hence you'd get much better performance.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype


On Mon, Feb 6, 2012 at 5:26 AM, Nitish Sharma <[hidden email]> wrote:
Hi,
We are planning to use ES to search through almost 2 billion documents
(and growing fast). Each document has one or more social interaction
associated with it. A search should be performed on document data as
well as on social interactions linked to it. We would like to have
community feedback on the model we have chosen.

We want to be able to do the following; imagine one document with two
social interactions. One interaction mentioning 'tree' and the other
'house'. A search on 'tree AND house' would yield this document.

We are in doubt how to record social interactions. We came up with
this model and it works for our search requirement:
1. a unique URL field
2. an array of social interactions
3. a social interaction consists of several text and integer fields

(See this Gist for a more complete JSON representation:
https://gist.github.com/1751349 )

The problem is appending social interactions. For every incoming
social interaction we have to do a GET request, checking if this
particular document already exists or not. If it does append the
interaction and POST. If it doesn't create a new record and POST. Is
this a problem in terms of overhead? We think it is.
Another problem with this is that we want to have multiple processes
updating/inserting documents. If two processes want to update (or
create) the same document this will lead to inconsistencies. We know
of the version functionality of ES, should we try to harness that?

An other problem entirely is the potential size of a document. Imagine
a document having tens of thousands of social interactions. Would the
document size grow prohibitively large? We expect to search on users.
A user is recorded in a social interaction. The search would yield the
whole (huge) document (and possibly more documents), rather than
returning only his interactions. Can we do something about this? Trim
the document, for example, before returning it?

Perhaps we should choose an other data model. Your help is greatly
appreciated.

Cheers
Nitish


Reply | Threaded
Open this post in threaded view
|

Re: Feedback on data model (over 1 billion documents)

Nitish Sharma
Hi
Thanks a lot for your replies folks!
We are already aware that flattening the data model would help us gain
significant indexing performance compared to *graph* like data model
we currently have.
There are two major problems with a flattened data model:
1. A search would literally return thousands of documents and most of
them would be pointing to same URL (since they are social interactions
on same entity). Consequently, filtering out unique documents would be
a time as well space (memory) consuming task.
2. One specific type of search query (the one we described in previous
post) cannot be supported with this model. If the social interactions
mentioning "tree" and "house", respectively, are separate documents,
then a search on "tree AND house" would not yield either of them.
While we expect this search to return the URL, since the entity
(pointed by URL) has "tree" as well as "house" keyword in the
associated interactions. Is it possible to perform this type of query
even on a flattened data model using some ElasticSearch construct (we
are not aware of)?

Regarding what we expect from a search depends on the type of the
search. Some searches are required to return only (unique)URLs, while
some other should return URLs as well as specific user interaction.

Cheers
Nitish

On Feb 7, 11:17 am, Shay Banon <[hidden email]> wrote:

> Berkay suggestion is a good one, what I would like to know more is what type of searches will be executed? i.e. do you expect to get the URLs back, or specific user interactions?
>
>
>
>
>
>
>
> On Monday, February 6, 2012 at 4:10 PM, Berkay Mollamustafaoglu wrote:
> > Flattening the data model would solve your problems assuming you can live with it.
> > You can add URL and rank as properties to each interaction and store each interaction as a separate document. You would get multiple docs per URL in the results, but it may be feasible to handle that in your application code.
> > With this data model, you'd only do a write to ES for each interaction hence you'd get much better performance.
>
> > Regards,
> > Berkay Mollamustafaoglu
> > mberkay on yahoo, google and skype
>
> > On Mon, Feb 6, 2012 at 5:26 AM, Nitish Sharma <[hidden email] (mailto:[hidden email])> wrote:
> > > Hi,
> > > We are planning to use ES to search through almost 2 billion documents
> > > (and growing fast). Each document has one or more social interaction
> > > associated with it. A search should be performed on document data as
> > > well as on social interactions linked to it. We would like to have
> > > community feedback on the model we have chosen.
>
> > > We want to be able to do the following; imagine one document with two
> > > social interactions. One interaction mentioning 'tree' and the other
> > > 'house'. A search on 'tree AND house' would yield this document.
>
> > > We are in doubt how to record social interactions. We came up with
> > > this model and it works for our search requirement:
> > > 1. a unique URL field
> > > 2. an array of social interactions
> > > 3. a social interaction consists of several text and integer fields
>
> > > (See this Gist for a more complete JSON representation:
> > >https://gist.github.com/1751349)
>
> > > The problem is appending social interactions. For every incoming
> > > social interaction we have to do a GET request, checking if this
> > > particular document already exists or not. If it does append the
> > > interaction and POST. If it doesn't create a new record and POST. Is
> > > this a problem in terms of overhead? We think it is.
> > > Another problem with this is that we want to have multiple processes
> > > updating/inserting documents. If two processes want to update (or
> > > create) the same document this will lead to inconsistencies. We know
> > > of the version functionality of ES, should we try to harness that?
>
> > > An other problem entirely is the potential size of a document. Imagine
> > > a document having tens of thousands of social interactions. Would the
> > > document size grow prohibitively large? We expect to search on users.
> > > A user is recorded in a social interaction. The search would yield the
> > > whole (huge) document (and possibly more documents), rather than
> > > returning only his interactions. Can we do something about this? Trim
> > > the document, for example, before returning it?
>
> > > Perhaps we should choose an other data model. Your help is greatly
> > > appreciated.
>
> > > Cheers
> > > Nitish
Reply | Threaded
Open this post in threaded view
|

Re: Feedback on data model (over 1 billion documents)

Ivan Brusic
In reply to this post by Dan Everton
Neo4j actually uses Lucene as its default backend index:
http://docs.neo4j.org/chunked/snapshot/indexing.html

On Mon, Feb 6, 2012 at 2:15 PM, Dan Everton <[hidden email]> wrote:
> Your data model sounds a lot like a graph. You may want to look in to
> a graph database like Neo4J coupled with Lucene directly rather than
> Elasticsearch.
Reply | Threaded
Open this post in threaded view
|

Re: Feedback on data model (over 1 billion documents)

Nitish Sharma
@Shay: Do you also think we should give a hard look into Neo4J? Its
search aspect is not as powerful, though.
Is there any possible way to store relationships between various
documents? I even tried the new update API to append interactions in
the document as they come in, but thats also really slow. Any
suggestions?

Cheers
N.

On Feb 9, 12:29 am, Ivan Brusic <[hidden email]> wrote:

> Neo4j actually uses Lucene as its default backend index:http://docs.neo4j.org/chunked/snapshot/indexing.html
>
>
>
>
>
>
>
> On Mon, Feb 6, 2012 at 2:15 PM, Dan Everton <[hidden email]> wrote:
> > Yourdatamodelsounds a lot like a graph. You may want to look in to
> > a graph database like Neo4J coupled with Lucene directly rather than
> > Elasticsearch.
Reply | Threaded
Open this post in threaded view
|

Re: Feedback on data model (over 1 billion documents)

kimchy
Administrator
You can store relationship between documents using the parent/child feature, but, you will need to make sure that a parent and its children can exist on a single shard (so they can be joined).

I have not used neo4j, so can't comment. What I can say is that with highly connected data, you still need to somehow partition the data at one point (or stay with a single server).

On Monday, February 13, 2012 at 6:40 PM, Nitish Sharma wrote:

@Shay: Do you also think we should give a hard look into Neo4J? Its
search aspect is not as powerful, though.
Is there any possible way to store relationships between various
documents? I even tried the new update API to append interactions in
the document as they come in, but thats also really slow. Any
suggestions?

Cheers
N.

On Feb 9, 12:29 am, Ivan Brusic <i...@brusic.com> wrote:
Neo4j actually uses Lucene as its default backend index:http://docs.neo4j.org/chunked/snapshot/indexing.html







On Mon, Feb 6, 2012 at 2:15 PM, Dan Everton <d...@iocaine.org> wrote:
Yourdatamodelsounds a lot like a graph. You may want to look in to
a graph database like Neo4J coupled with Lucene directly rather than
Elasticsearch.

Reply | Threaded
Open this post in threaded view
|

Re: Feedback on data model (over 1 billion documents)

Nitish Sharma
@Shay: Thanks very much. This parent/child feature may just do the
trick for us. We've experimented a bit with it, and it seems to fit
our requirements. Though, there are few more things we need from it,
namely:
1. Getting all children of a parent. I suppose there is no official
API call for that. Is it even possible to do that?
2. While doing a parent/child search, is it possible to define that
the results should contain only parent document or child documents or
both of them?

Cheers
Nitish
On Feb 14, 3:02 pm, Shay Banon <[hidden email]> wrote:

> You can store relationship between documents using the parent/child feature, but, you will need to make sure that a parent and its children can exist on a single shard (so they can be joined).
>
> I have not used neo4j, so can't comment. What I can say is that with highly connected data, you still need to somehow partition the data at one point (or stay with a single server).
>
>
>
>
>
>
>
> On Monday, February 13, 2012 at 6:40 PM, Nitish Sharma wrote:
> > @Shay: Do you also think we should give a hard look into Neo4J? Its
> > search aspect is not as powerful, though.
> > Is there any possible way to store relationships between various
> > documents? I even tried the new update API to append interactions in
> > the document as they come in, but thats also really slow. Any
> > suggestions?
>
> > Cheers
> > N.
>
> > On Feb 9, 12:29 am, Ivan Brusic <[hidden email] (http://brusic.com)> wrote:
> > > Neo4j actually uses Lucene as its default backend index:http://docs.neo4j.org/chunked/snapshot/indexing.html
>
> > > On Mon, Feb 6, 2012 at 2:15 PM, Dan Everton <[hidden email] (http://iocaine.org)> wrote:
> > > > Yourdatamodelsounds a lot like a graph. You may want to look in to
> > > > a graph database like Neo4J coupled with Lucene directly rather than
> > > > Elasticsearch.
Reply | Threaded
Open this post in threaded view
|

Re: Feedback on data model (over 1 billion documents)

Nitish Sharma
In reply to this post by kimchy
@Shay: I have another issue with parent/child search. I've been trying
to get filtered search work with parent/child structure. But I get an
error: "Parse Failure [No parser for element [filtered]".
Here is the gist: https://gist.github.com/1837082
Can you point out whats wrong with it?

Cheers
Nitish
On Feb 14, 3:02 pm, Shay Banon <[hidden email]> wrote:

> You can store relationship between documents using the parent/child feature, but, you will need to make sure that a parent and its children can exist on a single shard (so they can be joined).
>
> I have not used neo4j, so can't comment. What I can say is that with highly connected data, you still need to somehow partition the data at one point (or stay with a single server).
>
>
>
>
>
>
>
> On Monday, February 13, 2012 at 6:40 PM, Nitish Sharma wrote:
> > @Shay: Do you also think we should give a hard look into Neo4J? Its
> > search aspect is not as powerful, though.
> > Is there any possible way to store relationships between various
> > documents? I even tried the new update API to append interactions in
> > the document as they come in, but thats also really slow. Any
> > suggestions?
>
> > Cheers
> > N.
>
> > On Feb 9, 12:29 am, Ivan Brusic <[hidden email] (http://brusic.com)> wrote:
> > > Neo4j actually uses Lucene as its default backend index:http://docs.neo4j.org/chunked/snapshot/indexing.html
>
> > > On Mon, Feb 6, 2012 at 2:15 PM, Dan Everton <[hidden email] (http://iocaine.org)> wrote:
> > > > Yourdatamodelsounds a lot like a graph. You may want to look in to
> > > > a graph database like Neo4J coupled with Lucene directly rather than
> > > > Elasticsearch.
Reply | Threaded
Open this post in threaded view
|

Re: Feedback on data model (over 1 billion documents)

kimchy
Administrator
Regarding the query, you need to wrap the filtered part in a "query" element as well. 

Getting back the children for the parents will require an additional call, you can only get the parents matching the query back.

On Wednesday, February 15, 2012 at 6:33 PM, Nitish Sharma wrote:

@Shay: I have another issue with parent/child search. I've been trying
to get filtered search work with parent/child structure. But I get an
error: "Parse Failure [No parser for element [filtered]".
Can you point out whats wrong with it?

Cheers
Nitish
On Feb 14, 3:02 pm, Shay Banon <kim...@gmail.com> wrote:
You can store relationship between documents using the parent/child feature, but, you will need to make sure that a parent and its children can exist on a single shard (so they can be joined).

I have not used neo4j, so can't comment. What I can say is that with highly connected data, you still need to somehow partition the data at one point (or stay with a single server).







On Monday, February 13, 2012 at 6:40 PM, Nitish Sharma wrote:
@Shay: Do you also think we should give a hard look into Neo4J? Its
search aspect is not as powerful, though.
Is there any possible way to store relationships between various
documents? I even tried the new update API to append interactions in
the document as they come in, but thats also really slow. Any
suggestions?

Cheers
N.

On Feb 9, 12:29 am, Ivan Brusic <[hidden email] (http://brusic.com)> wrote:
Neo4j actually uses Lucene as its default backend index:http://docs.neo4j.org/chunked/snapshot/indexing.html

On Mon, Feb 6, 2012 at 2:15 PM, Dan Everton <[hidden email] (http://iocaine.org)> wrote:
Yourdatamodelsounds a lot like a graph. You may want to look in to
a graph database like Neo4J coupled with Lucene directly rather than
Elasticsearch.