|
Hello, I have quite a lot of data in Oracle and I need to keep all its data indexed in ES. I'm having trouble finding a good solution, so I hope you guys can give me some feedback.Thanks and best regards, Radu -- http://sematext.com/ -- ElasticSearch -- Solr -- Lucene You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Hi Radu,
a little bit sparsely documented, there is an "acknowledge statement" mechanism implemented since 2.0.0 in the JDBC river. The idea is that two SQL operations are executed per cycle: - one that selects rows from the RDBMS and constructs JSON objects - a second "acknowledge statement" that delete or update those rows This "handshake" method must use a "job ID" column in the RDBMS data (which is nothing but a version number in most cases) The two statements are executed on two JDBC connections, a read and a write connection, to ease the transaction management on the RDBMS side. There are almost no limits for the "job ID" column (but, some DBs, not Oracle, may not easily move large result sets due to some JDBC driver dependent oddities). The only task is to set up a suitable time interval with the "poll" parameter in the river und a mechanism on DB side to generate the "job IDs" in sequence and to create the data to be pushed. The JDBC river relies on the database connectivity given by JDBC. You are right about MonogDB log-based approach, this is most attractive with the ES river philosophy. With Oracle Streams, it should be possible to use JMS to connect to Oracle in a stream-like fashion. But, there is no JMS river yet: https://github.com/elasticsearch/elasticsearch/issues/1109 Best regards, Jörg Am 14.02.13 14:58, schrieb Radu Gheorghe: > Hello, > > I have quite a lot of data in Oracle and I need to keep all its data > indexed in ES. I'm having trouble finding a good solution, so I hope > you guys can give me some feedback. > > My use-case goes like this: > - there are lots of rows (hundreds of million) in quite a lot of > tables (10-20) > - the resulting documents are rather large, because they contain data > from all those tables. What's more, they're structured and there are > some arrays. Like, the document is the user, and the user can belong > to multiple groups > - so far the best fit seems to be Jörg's JDBC river[0], but I'm a bit > worried about managing updates. The "simple" strategy seems to need to > look at all data from Oracle, and I would like to avoid that. And I'm > a bit unclear about the "table" strategy - it seems like I would need > to manage the timestamp and versioning from Oracle? > > Ideally, what I'd want to have is something like the MongoDB river > does with the operations log. If that's not possible, what would you > recommend for this use-case? > > [0] https://github.com/jprante/elasticsearch-river-jdbc > > Thanks and best regards, > Radu > -- > http://sematext.com/ -- ElasticSearch -- Solr -- Lucene > -- > You received this message because you are subscribed to the Google > Groups "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send > an email to [hidden email]. > For more options, visit https://groups.google.com/groups/opt_out. > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Hi Jörg, Thanks a lot for replying! I'll ask some more inline :)On Thu, Feb 14, 2013 at 5:06 PM, Jörg Prante <[hidden email]> wrote: Hi Radu, I'm not sure I followed. So I guess I'll have to make a new column for every table named jobID or something like that, which will be updated with every operation. That seems clear. Then, how do I create the data that needs to be pushed? Do the joins and put that data in a different table, so the JDBC river can pull it using the "table" strategy? If yes, then does the JDBC river automatically remove processed records from that table? Or should I do that manually? If none of the above should be done, I guess that table will pretty much duplicate the data from the DB. Or am I missing something?
Thanks for the pointer. At this point I see two "paths": - build ES documents from the DB everytime, and try to filter by version/timestamp so I won't have to build *all* documents. Then re-index them - keep track of all changes for every table. Then for every change use the Update API to apply each change to the corresponding ES document(s). In my use-case I can easily track the ES ID where each row should go, so that might work. The downside of this is having to apply updates one by one (no bulk) Is there another option? Best regards, Radu
-- http://sematext.com/ -- ElasticSearch -- Solr -- Lucene You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Hi Radu,
yes, there are different tasks, when I think in "ETL-style": baseline loading (what you mention about the millions of rows) and incremental updates (inserts/updates/deletes). Baseline loading is straight forward (the "oneshot" JDB river strategy). For the incremental updates, there are options: - the DB "as is" must be examined for inserts/updates/deletes (like a "diff" operates on text lin files) - volatile data is created at DB side to get pushed 1:1 to ES (stream-like, the "table" strategy tries to do this, but is not well tested) At ES side, there are challenges: - stale data: data not in DB, but in ES - correct order of insert/update/delete (ES is not transactional) - synchronization (versioning), e.g. if the river was disabled for a while, how can a river missing data detect and how can a river reset to the correct point in time The JDBC river offers some rudimentary approaches: - it can use the versioning feature in ES for the river poll cycles - it can compute checksums over fetched data to detect differences between river poll cycles, and a housekeeper thread to delete documents with older versions What is missing: - a strategy for rolling indices (timestamp-based polls could be directed to aliased indexes) I think there is no one-size-fits-all solution, but I'd be happy about suggestions that can be added to the JDBC river, and maybe also useful for other rivers as well! Best regards, Jörg Am 15.02.13 10:54, schrieb Radu Gheorghe: > Thanks for the pointer. At this point I see two "paths": > - build ES documents from the DB everytime, and try to filter by > version/timestamp so I won't have to build *all* documents. Then > re-index them > - keep track of all changes for every table. Then for every change use > the Update API to apply each change to the corresponding ES > document(s). In my use-case I can easily track the ES ID where each > row should go, so that might work. The downside of this is having to > apply updates one by one (no bulk) > > Is there another option? > > Best regards, > Radu -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Hi Jörg,
-- On Fri, Feb 15, 2013 at 1:22 PM, Jörg Prante <[hidden email]> wrote: Hi Radu,
Right. That's my thinking as well.
Right. I got that covered.
OK. The first option won't work for me because of size, but I'll look some more at the "table" strategy. But doesn't this strategy imply that all the documents that should get in ES (hundreds of millions in my case) have to be in that table? Or does the JDBC river evict processed items?
Right. I'm aware of that.
Right. That's the key takeaway for me. I'll come up with suggestions and/or pull requests to your river if I get any good idea that would apply. Thanks a lot for your input! Best regards, Radu
-- http://sematext.com/ -- ElasticSearch -- Solr -- Lucene You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
No, you can first initiate a "oneshot" river. Afterwards, you can delete
this river (it runs only once) and you can use another river instance, maybe a table-based one. It needs some testing for the JSON objects so they match in the structure, but basically it should work. Jörg Am 15.02.13 12:51, schrieb Radu Gheorghe: > OK. The first option won't work for me because of size, but I'll look > some more at the "table" strategy. But doesn't this strategy imply > that all the documents that should get in ES (hundreds of millions in > my case) have to be in that table? Or does the JDBC river evict > processed items? -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
On Fri, Feb 15, 2013 at 2:25 PM, Jörg Prante <[hidden email]> wrote: No, you can first initiate a "oneshot" river. Got that. It would be the way to go for me. Afterwards, you can delete this river (it runs only once) and you can use another river instance, maybe a table-based one. It needs some testing for the JSON objects so they match in the structure, but basically it should work. Yes, but I'll have to put the "ES docs" in that table as they get updated in Oracle. Given that the river doesn't delete processed documents (or at least I missed the part in the code where it does it), it should mean that in quite a short time, that table will get huge. I was thinking about extending the river with a new parameter to remove processed documents from the source table. But if I do that it's going to be difficult to manage deletes (like you said, with stale data). Maybe a way to tackle that is by having an "is_deleted" column so the river knows to remove all docs where is_deleted=1. Does this sound like an option to you? Best regards, Radu
-- http://sematext.com/ -- ElasticSearch -- Solr -- Lucene You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Yes, that's what I currently try to do with the acknowledging SQL
statement in the "simple" river strategy. Example of such a handshake SQL statement pair (to sketch the idea how it is supposed to work) 1. select * from table where job_id = <jobid> 2. delete from table where job_id = <jobid> or 1. select * from table where job_id = <jobid> and received = 'N' 2. update table set received = 'Y' where job_id = <jobid> Jörg Am 15.02.13 13:44, schrieb Radu Gheorghe: > I was thinking about extending the river with a new parameter to > remove processed documents from the source table. But if I do that > it's going to be difficult to manage deletes (like you said, with > stale data). Maybe a way to tackle that is by having an "is_deleted" > column so the river knows to remove all docs where is_deleted=1. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
|
Thanks a lot, Jörg! I'll try the Update API approach first (see a couple of Emails earlier), but I'm not sure if this will be enough. So I'll probably come back to the JDBC river soon :)Best regards, Radu On Fri, Feb 15, 2013 at 2:52 PM, Jörg Prante <[hidden email]> wrote: Yes, that's what I currently try to do with the acknowledging SQL statement in the "simple" river strategy. -- http://sematext.com/ -- ElasticSearch -- Solr -- Lucene You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/groups/opt_out. |
| Powered by Nabble | Edit this page |
