Quantcast

Options for synchronizing SQL (Oracle) with Elasticsearch

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Options for synchronizing SQL (Oracle) with Elasticsearch

Radu Gheorghe-2
Hello,

I have quite a lot of data in Oracle and I need to keep all its data indexed in ES. I'm having trouble finding a good solution, so I hope you guys can give me some feedback.

My use-case goes like this:
- there are lots of rows (hundreds of million) in quite a lot of tables (10-20)
- the resulting documents are rather large, because they contain data from all those tables. What's more, they're structured and there are some arrays. Like, the document is the user, and the user can belong to multiple groups
- so far the best fit seems to be Jörg's JDBC river[0], but I'm a bit worried about managing updates. The "simple" strategy seems to need to look at all data from Oracle, and I would like to avoid that. And I'm a bit unclear about the "table" strategy - it seems like I would need to manage the timestamp and versioning from Oracle?

Ideally, what I'd want to have is something like the MongoDB river does with the operations log. If that's not possible, what would you recommend for this use-case?
Thanks and best regards,
Radu
--
http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Options for synchronizing SQL (Oracle) with Elasticsearch

joergprante@gmail.com
Hi Radu,

a little bit sparsely documented, there is an "acknowledge statement"
mechanism implemented since 2.0.0 in the JDBC river.

The idea is that two SQL operations are executed per cycle:

- one that selects rows from the RDBMS and constructs JSON objects
- a second "acknowledge statement" that delete or update those rows

This "handshake" method must use a "job ID" column in the RDBMS data
(which is nothing but a version number in most cases)

The two statements are executed on two JDBC connections, a read and a
write connection, to ease the transaction management on the RDBMS side.

There are almost no limits for the "job ID" column (but, some DBs, not
Oracle, may not easily move large result sets due to some JDBC driver
dependent oddities). The only task is to set up a suitable time interval
with the "poll" parameter in the river und a mechanism on DB side to
generate the "job IDs" in sequence and to create the data to be pushed.

The JDBC river relies on the database connectivity given by JDBC. You
are right about MonogDB log-based approach, this is most attractive with
the ES river philosophy. With Oracle Streams, it should be possible to
use JMS to connect to Oracle in a stream-like fashion. But, there is no
JMS river yet: https://github.com/elasticsearch/elasticsearch/issues/1109

Best regards,

Jörg

Am 14.02.13 14:58, schrieb Radu Gheorghe:

> Hello,
>
> I have quite a lot of data in Oracle and I need to keep all its data
> indexed in ES. I'm having trouble finding a good solution, so I hope
> you guys can give me some feedback.
>
> My use-case goes like this:
> - there are lots of rows (hundreds of million) in quite a lot of
> tables (10-20)
> - the resulting documents are rather large, because they contain data
> from all those tables. What's more, they're structured and there are
> some arrays. Like, the document is the user, and the user can belong
> to multiple groups
> - so far the best fit seems to be Jörg's JDBC river[0], but I'm a bit
> worried about managing updates. The "simple" strategy seems to need to
> look at all data from Oracle, and I would like to avoid that. And I'm
> a bit unclear about the "table" strategy - it seems like I would need
> to manage the timestamp and versioning from Oracle?
>
> Ideally, what I'd want to have is something like the MongoDB river
> does with the operations log. If that's not possible, what would you
> recommend for this use-case?
>
> [0] https://github.com/jprante/elasticsearch-river-jdbc
>
> Thanks and best regards,
> Radu
> --
> http://sematext.com/ -- ElasticSearch -- Solr -- Lucene
> --
> You received this message because you are subscribed to the Google
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [hidden email].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Options for synchronizing SQL (Oracle) with Elasticsearch

Radu Gheorghe-2
Hi Jörg,

Thanks a lot for replying!  I'll ask some more inline :)

On Thu, Feb 14, 2013 at 5:06 PM, Jörg Prante <[hidden email]> wrote:
Hi Radu,

a little bit sparsely documented, there is an "acknowledge statement" mechanism implemented since 2.0.0 in the JDBC river.

The idea is that two SQL operations are executed per cycle:

- one that selects rows from the RDBMS and constructs JSON objects
- a second "acknowledge statement" that delete or update those rows

This "handshake" method must use a "job ID" column in the RDBMS data (which is nothing but a version number in most cases)

The two statements are executed on two JDBC connections, a read and a write connection, to ease the transaction management on the RDBMS side.

There are almost no limits for the "job ID" column (but, some DBs, not Oracle, may not easily move large result sets due to some JDBC driver dependent oddities). The only task is to set up a suitable time interval with the "poll" parameter in the river und a mechanism on DB side to generate the "job IDs" in sequence and to create the data to be pushed.

I'm not sure I followed. So I guess I'll have to make a new column for every table named jobID or something like that, which will be updated with every operation. That seems clear.

Then, how do I create the data that needs to be pushed? Do the joins and put that data in a different table, so the JDBC river can pull it using the "table" strategy?

If yes, then does the JDBC river automatically remove processed records from that table? Or should I do that manually?

If none of the above should be done, I guess that table will pretty much duplicate the data from the DB. Or am I missing something?
 

The JDBC river relies on the database connectivity given by JDBC. You are right about MonogDB log-based approach, this is most attractive with the ES river philosophy. With Oracle Streams, it should be possible to use JMS to connect to Oracle in a stream-like fashion. But, there is no JMS river yet: https://github.com/elasticsearch/elasticsearch/issues/1109

Thanks for the pointer. At this point I see two "paths":
- build ES documents from the DB everytime, and try to filter by version/timestamp so I won't have to build *all* documents. Then re-index them
- keep track of all changes for every table. Then for every change use the Update API to apply each change to the corresponding ES document(s). In my use-case I can easily track the ES ID where each row should go, so that might work. The downside of this is having to apply updates one by one (no bulk)

Is there another option?

Best regards,
Radu
 

Best regards,

Jörg

Am 14.02.13 14:58, schrieb Radu Gheorghe:
Hello,

I have quite a lot of data in Oracle and I need to keep all its data indexed in ES. I'm having trouble finding a good solution, so I hope you guys can give me some feedback.

My use-case goes like this:
- there are lots of rows (hundreds of million) in quite a lot of tables (10-20)
- the resulting documents are rather large, because they contain data from all those tables. What's more, they're structured and there are some arrays. Like, the document is the user, and the user can belong to multiple groups
- so far the best fit seems to be Jörg's JDBC river[0], but I'm a bit worried about managing updates. The "simple" strategy seems to need to look at all data from Oracle, and I would like to avoid that. And I'm a bit unclear about the "table" strategy - it seems like I would need to manage the timestamp and versioning from Oracle?

Ideally, what I'd want to have is something like the MongoDB river does with the operations log. If that's not possible, what would you recommend for this use-case?

[0] https://github.com/jprante/elasticsearch-river-jdbc

Thanks and best regards,
Radu
--
http://sematext.com/ -- ElasticSearch -- Solr -- Lucene
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.



--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.





--
http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Options for synchronizing SQL (Oracle) with Elasticsearch

joergprante@gmail.com
Hi Radu,

yes, there are different tasks, when I think in "ETL-style": baseline
loading (what you mention about the millions of rows) and incremental
updates (inserts/updates/deletes).

Baseline loading is straight forward (the "oneshot" JDB river strategy).

For the incremental updates, there are options:

- the DB "as is" must be examined for inserts/updates/deletes (like a
"diff" operates on text lin files)
- volatile data is created at DB side to get pushed 1:1 to ES
(stream-like, the "table" strategy tries to do this, but is not well tested)

At ES side, there are challenges:

- stale data: data not in DB, but in ES
- correct order of insert/update/delete (ES is not transactional)
- synchronization (versioning), e.g. if the river was disabled for a
while, how can a river missing data detect and how can a river reset to
the correct point in time

The JDBC river offers some rudimentary approaches:

- it can use the versioning feature in ES for the river poll cycles
- it can compute checksums over fetched data to detect differences
between river poll cycles, and a housekeeper thread to delete documents
with older versions

What is missing:

- a strategy for rolling indices (timestamp-based polls could be
directed to aliased indexes)

I think there is no one-size-fits-all solution, but I'd be happy about
suggestions that can be added to the JDBC river, and maybe also useful
for other rivers as well!

Best regards,

Jörg

Am 15.02.13 10:54, schrieb Radu Gheorghe:

> Thanks for the pointer. At this point I see two "paths":
> - build ES documents from the DB everytime, and try to filter by
> version/timestamp so I won't have to build *all* documents. Then
> re-index them
> - keep track of all changes for every table. Then for every change use
> the Update API to apply each change to the corresponding ES
> document(s). In my use-case I can easily track the ES ID where each
> row should go, so that might work. The downside of this is having to
> apply updates one by one (no bulk)
>
> Is there another option?
>
> Best regards,
> Radu

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Options for synchronizing SQL (Oracle) with Elasticsearch

Radu Gheorghe-2
Hi Jörg,

On Fri, Feb 15, 2013 at 1:22 PM, Jörg Prante <[hidden email]> wrote:
Hi Radu,

yes, there are different tasks, when I think in "ETL-style": baseline loading (what you mention about the millions of rows) and incremental updates (inserts/updates/deletes).

Right. That's my thinking as well.
 

Baseline loading is straight forward (the "oneshot" JDB river strategy).

Right. I got that covered.
 

For the incremental updates, there are options:

- the DB "as is" must be examined for inserts/updates/deletes (like a "diff" operates on text lin files)
- volatile data is created at DB side to get pushed 1:1 to ES (stream-like, the "table" strategy tries to do this, but is not well tested)

OK. The first option won't work for me because of size, but I'll look some more at the "table" strategy. But doesn't this strategy imply that all the documents that should get in ES (hundreds of millions in my case) have to be in that table? Or does the JDBC river evict processed items?
 

At ES side, there are challenges:

- stale data: data not in DB, but in ES
- correct order of insert/update/delete (ES is not transactional)
- synchronization (versioning), e.g. if the river was disabled for a while, how can a river missing data detect and how can a river reset to the correct point in time

Right. I'm aware of that.
 

The JDBC river offers some rudimentary approaches:

- it can use the versioning feature in ES for the river poll cycles
- it can compute checksums over fetched data to detect differences between river poll cycles, and a housekeeper thread to delete documents with older versions

What is missing:

- a strategy for rolling indices (timestamp-based polls could be directed to aliased indexes)

I think there is no one-size-fits-all solution, but I'd be happy about suggestions that can be added to the JDBC river, and maybe also useful for other rivers as well!

Right. That's the key takeaway for me. I'll come up with suggestions and/or pull requests to your river if I get any good idea that would apply.

Thanks a lot for your input!

Best regards,
Radu
 

Best regards,

Jörg

Am 15.02.13 10:54, schrieb Radu Gheorghe:

Thanks for the pointer. At this point I see two "paths":
- build ES documents from the DB everytime, and try to filter by version/timestamp so I won't have to build *all* documents. Then re-index them
- keep track of all changes for every table. Then for every change use the Update API to apply each change to the corresponding ES document(s). In my use-case I can easily track the ES ID where each row should go, so that might work. The downside of this is having to apply updates one by one (no bulk)

Is there another option?

Best regards,
Radu

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.





--
http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Options for synchronizing SQL (Oracle) with Elasticsearch

joergprante@gmail.com
No, you can first initiate a "oneshot" river. Afterwards, you can delete
this river (it runs only once) and you can use another river instance,
maybe a table-based one. It needs some testing for the JSON objects so
they match in the structure, but basically it should work.

Jörg

Am 15.02.13 12:51, schrieb Radu Gheorghe:
> OK. The first option won't work for me because of size, but I'll look
> some more at the "table" strategy. But doesn't this strategy imply
> that all the documents that should get in ES (hundreds of millions in
> my case) have to be in that table? Or does the JDBC river evict
> processed items?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Options for synchronizing SQL (Oracle) with Elasticsearch

Radu Gheorghe-2
On Fri, Feb 15, 2013 at 2:25 PM, Jörg Prante <[hidden email]> wrote:
No, you can first initiate a "oneshot" river.

Got that. It would be the way to go for me.
 
Afterwards, you can delete this river (it runs only once) and you can use another river instance, maybe a table-based one. It needs some testing for the JSON objects so they match in the structure, but basically it should work.

Yes, but I'll have to put the "ES docs" in that table as they get updated in Oracle. Given that the river doesn't delete processed documents (or at least I missed the part in the code where it does it), it should mean that in quite a short time, that table will get huge.

I was thinking about extending the river with a new parameter to remove processed documents from the source table. But if I do that it's going to be difficult to manage deletes (like you said, with stale data). Maybe a way to tackle that is by having an "is_deleted" column so the river knows to remove all docs where is_deleted=1.

Does this sound like an option to you?

Best regards,
Radu
 

Jörg

Am 15.02.13 12:51, schrieb Radu Gheorghe:

OK. The first option won't work for me because of size, but I'll look some more at the "table" strategy. But doesn't this strategy imply that all the documents that should get in ES (hundreds of millions in my case) have to be in that table? Or does the JDBC river evict processed items?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.





--
http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Options for synchronizing SQL (Oracle) with Elasticsearch

joergprante@gmail.com
Yes, that's what I currently try to do with the acknowledging SQL
statement in the "simple" river strategy.

Example of such a handshake SQL statement pair (to sketch the idea how
it is supposed to work)

1. select * from table where job_id = <jobid>
2. delete from table where job_id = <jobid>

or

1. select * from table where job_id = <jobid> and received = 'N'
2. update table set received = 'Y' where job_id = <jobid>

Jörg

Am 15.02.13 13:44, schrieb Radu Gheorghe:
> I was thinking about extending the river with a new parameter to
> remove processed documents from the source table. But if I do that
> it's going to be difficult to manage deletes (like you said, with
> stale data). Maybe a way to tackle that is by having an "is_deleted"
> column so the river knows to remove all docs where is_deleted=1.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Options for synchronizing SQL (Oracle) with Elasticsearch

Radu Gheorghe-2
Thanks a lot, Jörg!

I'll try the Update API approach first (see a couple of Emails earlier), but I'm not sure if this will be enough. So I'll probably come back to the JDBC river soon :)

Best regards,
Radu

On Fri, Feb 15, 2013 at 2:52 PM, Jörg Prante <[hidden email]> wrote:
Yes, that's what I currently try to do with the acknowledging SQL statement in the "simple" river strategy.

Example of such a handshake SQL statement pair (to sketch the idea how it is supposed to work)

1. select * from table where job_id = <jobid>
2. delete from table where job_id = <jobid>

or

1. select * from table where job_id = <jobid> and received = 'N'
2. update table set received = 'Y' where job_id = <jobid>

Jörg

Am 15.02.13 13:44, schrieb Radu Gheorghe:

I was thinking about extending the river with a new parameter to remove processed documents from the source table. But if I do that it's going to be difficult to manage deletes (like you said, with stale data). Maybe a way to tackle that is by having an "is_deleted" column so the river knows to remove all docs where is_deleted=1.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.





--
http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Loading...