Quantcast

CAP theorem

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

CAP theorem

talsalmona
Hi,

According to the CAP theorem (http://en.wikipedia.org/wiki/CAP_theorem
and http://books.couchdb.org/relax/intro/eventual-consistency),
ElasticSearch can satisfy two of the following:
* Consistency
* Availability
* Partition Tolerance

My guess is Availability and Partition Tolerance are the ones
supported by ElasticSearch and consistency is eventual, meaning that
two clients may sometimes get different results when executing the
same query.
Am I right?

Thanks,
Tal
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CAP theorem

Sergio Bossa
On Sun, Jun 13, 2010 at 6:51 AM, Tal <[hidden email]> wrote:

> My guess is Availability and Partition Tolerance are the ones
> supported by ElasticSearch and consistency is eventual, meaning that
> two clients may sometimes get different results when executing the
> same query.
> Am I right?

Nope, AFAIK (please Shay correct me if wrong) ElasticSearch provides
per-document consistency, meaning that writes will be atomically
executed on the "document owner" shard and synchronously replicated to
replica shards.
Regarding A vs P, I honestly don't know how ElasticSearch behaves in
case of network partitions: Shay will surely be able to spread some
light (discussion is interesting indeed).

Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CAP theorem

kimchy
Administrator
Hi,

   When it comes to CAP, in a very high level, elasticsearch gives up on partition tolerance. This is for several reasons:

1. I personally believe that *within the same data center", network partitions very rarely happen, and when they do, its a small set (many times single) machine that gets "partitioned out of the network". When a single machine gets disconnected from the network, then thats not going to affect elasticsearch. When it come to cross data centers, a solution that gives up on consistency can be built on top of elasticsearch, either by elasticsearch (in the future), or now, by using messaging in some form to replicate changes between two data centers.

2. When it comes to search engines, and inverted index, its very hard to the point of impossible to build a solution that tries to resolve consistency problems on the fly as most products do (the famous "read repair"). When you search, you search over a large amount of docs and you can't read repair each one all the time. Key value / column based solutions have life easy.... .

   It might seem confusing and seem like consistency is the one elasticsearch chooses to give up on because of the near real time nature of it. But its not. The near real time aspect is mainly due to the overhead of making changes visible for search, but the changes are there once they are performed. Note, there are nice advancements to become full real time. I talked to Michael Bush at berlin buzzwords, and he implemented a very nice real time solution for solution. It is actually works the same way to something I started working on, but now I can wait for it :).

-shay.banon

On Sun, Jun 13, 2010 at 11:27 AM, Sergio Bossa <[hidden email]> wrote:
On Sun, Jun 13, 2010 at 6:51 AM, Tal <[hidden email]> wrote:

> My guess is Availability and Partition Tolerance are the ones
> supported by ElasticSearch and consistency is eventual, meaning that
> two clients may sometimes get different results when executing the
> same query.
> Am I right?

Nope, AFAIK (please Shay correct me if wrong) ElasticSearch provides
per-document consistency, meaning that writes will be atomically
executed on the "document owner" shard and synchronously replicated to
replica shards.
Regarding A vs P, I honestly don't know how ElasticSearch behaves in
case of network partitions: Shay will surely be able to spread some
light (discussion is interesting indeed).

Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CAP theorem

Sergio Bossa
On Mon, Jun 14, 2010 at 2:29 PM, Shay Banon
<[hidden email]> wrote:

> 1. I personally believe that *within the same data center", network
> partitions very rarely happen, and when they do, its a small set (many times
> single) machine that gets "partitioned out of the network". When a single
> machine gets disconnected from the network, then thats not going to affect
> elasticsearch.

More specifically, how does ES cope with partitioned nodes?
Say we have 3 nodes (a, b, c) with 'c' as the master, and a partition
happens as follows: (a, b) (c), how does ES behave? How does it avoid
split brain problems?

--
Sergio Bossa
http://www.linkedin.com/in/sergiob
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CAP theorem

kimchy
Administrator
You can't avoid split brain, it will happen once you get network partitioning, the question is how do you resolve it, and also, what the effect of the split brain is. In your scenario, once the split will happen, either (a) or (b) will become master of the sub cluster. If (c) got total disconnection, then its great, since clients won't be able to see it as well, so no data will be lost. If, on the other hand, clients got partitioned with (c) as well, then they will continue to work with (c), while other clients will work with (a) and (b).

Once the network partition is resolved, then some sort of data resolution needs to occur, either by discarding the small cluster, or by doing version / conflict resolution.

-shay.banon

On Wed, Jun 16, 2010 at 6:06 PM, Sergio Bossa <[hidden email]> wrote:
On Mon, Jun 14, 2010 at 2:29 PM, Shay Banon
<[hidden email]> wrote:

> 1. I personally believe that *within the same data center", network
> partitions very rarely happen, and when they do, its a small set (many times
> single) machine that gets "partitioned out of the network". When a single
> machine gets disconnected from the network, then thats not going to affect
> elasticsearch.

More specifically, how does ES cope with partitioned nodes?
Say we have 3 nodes (a, b, c) with 'c' as the master, and a partition
happens as follows: (a, b) (c), how does ES behave? How does it avoid
split brain problems?

--

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CAP theorem

Sergio Bossa
On Wed, Jun 16, 2010 at 9:30 PM, Shay Banon
<[hidden email]> wrote:

> You can't avoid split brain, it will happen once you get network
> partitioning,

Not sure about this statement.
While I'm far from being a distributed systems expert, you're probably
way more than me, I think the impossibility result pretty depends on
the system assumptions.
Clearly, it's impossible to avoid split brain in fully asynchronous
networks with lost messages.
But, while you're forced to keep the assumption above for partition
tolerant *and* available systems, ES doesn't aim at tolerating
partitions (as you stated in your previous mail), so you can actually
avoid split brains by employing quorum algorithms and tolerating only
a given number of "failures" (because consistency needs to be
preserved), blocking the eventually partitioned nodes until they get
reconnected.
You could also choose to embrace partition tolerance over
availability, and adopt the solution above but avoid to block
partitioned nodes and just assume a fail-stop for them, or, assign a
static "owner" to a given index/shard, so that if the owner fails/gets
partitioned all writes to its index/shard will be prohibited to keep
consistency.

> If, on the other hand, clients got
> partitioned with (c) as well, then they will continue to work with (c),
> while other clients will work with (a) and (b).

This kind of solutions describes indeed an available and partition
tolerant system, because both ends of your partition stay available
and you're actually sacrificing consistency.

> Once the network partition is resolved, then some sort of data resolution
> needs to occur, either by discarding the small cluster, or by doing version
> / conflict resolution.

Which is very hard for indexed data. Moreover, if this needs to be
done manually, users may actually miss inconsistencies or mess up
things ...

I honestly thought ES provided some kind of algorithm to avoid
inconsistencies, it doesn't seem to be the case. I'm not saying it's
bad, just that it's different than I expected.

Thanks for your attention,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CAP theorem

kimchy
Administrator
Maan, discussion for this should be done over a beer and not over emails, its very hard to convey all the different aspects and trying to write short answers just missed the point, but I will try and write something... .

What I meant when I said you can't get around split brains was actually you can't avoid network partitions, the question is what you do with it, and if split brains can be avoided or not. One of the things I don't like (even in respectable projects) is the "assertions" that this product do that or do this, especially since CAP relates to point of time ....

First thing to note is the fact that a search engine is very different than key value or column based storage systems. A typical search query hits huge amount of data, and you want to keep scoring correct for them without needing to "repair" each search doc per search.

Another point is to differentiate between what elasticsearch does now, and what it is designed to do in its final version. It is certainly architected in a way to solve most things, and behave based on what the use decided to give up on, but its simply not implemented yet... (I will point what is not implemented later).

So, one of the main problem when network partitioning happens is that you can't know if the node that got partitioned is down or not reachable, and the question is how do you handle it. It gets even worse if that node that was partitioned got partitioned with clients connected to it, that keep on working against it. It gets even more interesting in systems that reallocate shards in the event of node failure (or node getting partitioned out) as elasticsearch does.

The main problem with the above is the fact that clients (that got partitioned with the offending node(s)) might still be sending data to them, and in which case you have two options. The first is to *not* allow them to do changes (search might still be ok), and the other is to allow them to write and reconcile once the network partitioning is resolved.

Lets start with not allowing writes. For that, you need to identify when writes will not be allowed. This is much simpler to do when you have a fixed location master(s), since if you lost connection to it (or to a quorum of them), you basically get into this mode. In elasticsearch there is no fixed location master OOB, so the other option is to do some sort of quorum across all the nodes in the cluster. Thats not simple, mainly because a cluster of elasticsearch might be reduced to a smaller size intentionally. One simple option to solve this is to have the user define a "minimum" size of cluster that should be up, and if its not, the nodes will not allow writes. This will be in elasticsearch and its actually not that difficult to implement. This actually solves a lot of cases. Assuming you managed to identify it and block writes, then resolving the network partitioning once it is solved is simple, you just treat the other nodes as fresh ones.

As a side note, elasticsearch is architected in a way that implementing fixed location masters should be possible, and then you can easily implements "big table"/"hadoop" like solution. This will also be available as an option in the GA elasticsearch version.

Yet another option is to allow writes always, and reconcile changes when a network partitioning is resolved. This can be solved in elasticsearch case by adding version/timestamp for each doc indexed, and the reconciliation will occur when the network partitioning is resolved (and not do read repair). Two identical shards, one from each network partition area, will get reconciled based on the changes done to them (either through versioning / vector clock / timestamp). The typical problem with this is handling deletes, usually by adding delete markers (tombstone) but there is full proof solution for this since you will need to delete those at some point (its explained, in cassandra case, here: http://wiki.apache.org/cassandra/DistributedDeletes).

This is also something that I do want to support in elasticsearch, but more into the future.

Hope the above make sense, and explain things (I skipped some things, otherwise this email will need to be turned into a book ;) ). Let me finish with a personal rant. My problem with CAP theorem is that it seems so simple, 3 simple rules that only two can be realized , that people make the mistake of really simplifying distributed systems and missing a lot of "fine prints" in those systems. My other problem is with the dynamo paper, which again, really simplifies distributed systems and is, IMO, a patch built on top of another patch. I would say that if Amazon would have written "dynamo paper, the revenge of Mr. Vogel", it would have been much much longer ;). 

For example, not many people are aware of Cassandra delete handling, and the fact that they might get lost. You can insert and then delete, and that delete might get lost because of its deletion handling. Another problem is that inserts might get lost as well..., with hinted handoff. Or couchdb, that looses all the "history" of changes once a compaction occurs, so you can't really resolve changes properly once a compaction has run. And this list goes on for other solutions. 

Or with terractotta, where not using sync writes (which syncs to disk) means that you might loose data in the event of a failure (as far as I know). Or if you don't have enough mem to get berkley btree nodes loaded into them. Or when you go server with hot backup, and what happens when they get partitioned (what really happens, btw? you are more of a terracotta expert than myself..., even more interesting is what happens with server arrays).

I am not saying that those are not good products, and I would even say that you probably have to solve things in the manner that they chose to solve them, but, those "fine prints" are things that typical users won't know and assume that those products are "magical" almost as much as the ipad.

-shay.banon

On Thu, Jun 17, 2010 at 11:29 AM, Sergio Bossa <[hidden email]> wrote:
On Wed, Jun 16, 2010 at 9:30 PM, Shay Banon
<[hidden email]> wrote:

> You can't avoid split brain, it will happen once you get network
> partitioning,

Not sure about this statement.
While I'm far from being a distributed systems expert, you're probably
way more than me, I think the impossibility result pretty depends on
the system assumptions.
Clearly, it's impossible to avoid split brain in fully asynchronous
networks with lost messages.
But, while you're forced to keep the assumption above for partition
tolerant *and* available systems, ES doesn't aim at tolerating
partitions (as you stated in your previous mail), so you can actually
avoid split brains by employing quorum algorithms and tolerating only
a given number of "failures" (because consistency needs to be
preserved), blocking the eventually partitioned nodes until they get
reconnected.
You could also choose to embrace partition tolerance over
availability, and adopt the solution above but avoid to block
partitioned nodes and just assume a fail-stop for them, or, assign a
static "owner" to a given index/shard, so that if the owner fails/gets
partitioned all writes to its index/shard will be prohibited to keep
consistency.

> If, on the other hand, clients got
> partitioned with (c) as well, then they will continue to work with (c),
> while other clients will work with (a) and (b).

This kind of solutions describes indeed an available and partition
tolerant system, because both ends of your partition stay available
and you're actually sacrificing consistency.

> Once the network partition is resolved, then some sort of data resolution
> needs to occur, either by discarding the small cluster, or by doing version
> / conflict resolution.

Which is very hard for indexed data. Moreover, if this needs to be
done manually, users may actually miss inconsistencies or mess up
things ...

I honestly thought ES provided some kind of algorithm to avoid
inconsistencies, it doesn't seem to be the case. I'm not saying it's
bad, just that it's different than I expected.

Thanks for your attention,

Sergio B.

--

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CAP theorem

Sergio Bossa
On Thu, Jun 17, 2010 at 2:37 PM, Shay Banon
<[hidden email]> wrote:

> Maan, discussion for this should be done over a beer and not over emails,
> its very hard to convey all the different aspects and trying to write short
> answers just missed the point, but I will try and write something... .

I know, I know, but "a few" miles separate us, so no beers and chats,
and the only things remaining are emails ;)

You gave a very satisfying and informative answer, and I think it's
actually very important for ES end users: because they may think ES is
magical (like the iPad as you said), but it isn't.
It currently sacrifice consistency in case of partitions in the terms
you explained, and you are planning to implement a bunch of cool
features to make ES suit different needs: those are all important bits
of information, and I think users will be grateful for them. At least,
I am ;)

Thanks,
Cheers!

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CAP theorem

kimchy
Administrator
Well, next time I am in Rome (well, never been, so first time I will be there ;) )... . Happy the answers make sense, btw, you did not answer regarding the terracotta ones, this is something that I always wanted to know about but could not find anything in the docs... .

-shay.banon

On Thu, Jun 17, 2010 at 7:07 PM, Sergio Bossa <[hidden email]> wrote:
On Thu, Jun 17, 2010 at 2:37 PM, Shay Banon
<[hidden email]> wrote:

> Maan, discussion for this should be done over a beer and not over emails,
> its very hard to convey all the different aspects and trying to write short
> answers just missed the point, but I will try and write something... .

I know, I know, but "a few" miles separate us, so no beers and chats,
and the only things remaining are emails ;)

You gave a very satisfying and informative answer, and I think it's
actually very important for ES end users: because they may think ES is
magical (like the iPad as you said), but it isn't.
It currently sacrifice consistency in case of partitions in the terms
you explained, and you are planning to implement a bunch of cool
features to make ES suit different needs: those are all important bits
of information, and I think users will be grateful for them. At least,
I am ;)

Thanks,
Cheers!

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CAP theorem

Sergio Bossa
On Sat, Jun 19, 2010 at 11:55 PM, Shay Banon
<[hidden email]> wrote:

> Well, next time I am in Rome (well, never been, so first time I will be
> there ;) )... .

Anytime ;)

>Happy the answers make sense, btw, you did not answer
> regarding the terracotta ones, this is something that I always wanted to
> know about but could not find anything in the docs... .

Sorry, missed your questions :)
Anyways, Terracotta should work as follows:
1) In case of client or server failure, using async writes, there's no
data loss provided you run an active/passive pair: the passive one
will take the transaction over and complete it as the new active.
2) In case of active/passive server partitioning, the currently active
one will keep its clients connected with, while the passive one will
elect itself as a master, but with no attached clients, and there will
be so no split brain; once the partition heals again, the server which
kept the attached clients will zap the other one and downshift it to
passive state.
In the end, you could have a split brain only if you had one server
and a bunch of clients on one switch, and another server and another
bunch of clients on another switch, and the switches get partitioned
... a pretty bizarre network configuration, provided you're not
running in the cloud ... so, Terracotta also has its own split brain
vulnerabilities, which are IMHO less common than the ES ones, but
Terracotta is master based so it's easy to manage coordination, while
ES is decentralized and yadda-yadda-yadda ... you get the idea :)

Hope that answers your questions ... feel free to ask more obviously ;)
Cheers!

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: CAP theorem

kimchy
Administrator
Interesting. First of all, in most cases, you have both es clients ("native clients") and the nodes within the same network switch, so the changes are identical to terracotta for split brain with clients.

So, now I understand, TC has sync replication between active and passive servers, and writing to disk (in async manner, I presume).

Yet another question in the TC case then. Lets assume you have two server, active and passive, both, I assume, write to the local disk their state. Now, you bring down the active server, the passive becomes active (master), and clients starts writing to new active server.

Now, I bring down the last server, which is active, and afterwards, start the first server, which, I assume, becomes active and starts to receive client requests. Now, that server has an old view of the data, since clients have been performing changes to the other server while it was down. How does TC recover from that?

-shay.banon

On Sun, Jun 20, 2010 at 4:37 PM, Sergio Bossa <[hidden email]> wrote:
On Sat, Jun 19, 2010 at 11:55 PM, Shay Banon
<[hidden email]> wrote:

> Well, next time I am in Rome (well, never been, so first time I will be
> there ;) )... .

Anytime ;)

>Happy the answers make sense, btw, you did not answer
> regarding the terracotta ones, this is something that I always wanted to
> know about but could not find anything in the docs... .

Sorry, missed your questions :)
Anyways, Terracotta should work as follows:
1) In case of client or server failure, using async writes, there's no
data loss provided you run an active/passive pair: the passive one
will take the transaction over and complete it as the new active.
2) In case of active/passive server partitioning, the currently active
one will keep its clients connected with, while the passive one will
elect itself as a master, but with no attached clients, and there will
be so no split brain; once the partition heals again, the server which
kept the attached clients will zap the other one and downshift it to
passive state.
In the end, you could have a split brain only if you had one server
and a bunch of clients on one switch, and another server and another
bunch of clients on another switch, and the switches get partitioned
... a pretty bizarre network configuration, provided you're not
running in the cloud ... so, Terracotta also has its own split brain
vulnerabilities, which are IMHO less common than the ES ones, but
Terracotta is master based so it's easy to manage coordination, while
ES is decentralized and yadda-yadda-yadda ... you get the idea :)

Hope that answers your questions ... feel free to ask more obviously ;)
Cheers!

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Loading...