[hadoop] newbie question

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[hadoop] newbie question

liorg2
hi,

i have few basic questions about es-hadoop,
and i would really appreciate your kind help 

1. if i have currently ES cluster, do i have motivation to add hadoop layer?

2. is the idea of ES-hadoop, that hadoop will be the data store, and ES the search engine above it?

3. can logstash write to hadoop?

4. when i run queries to ES, does it go to HDFS in real time?

thanks a lot!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cf3e6331-2ac2-4967-89b9-a60f2967b8de%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [hadoop] newbie question

Mark Walkom-2
  1. Maybe, depends on your use case.
  2. No, they connect but ES does not store data on HDFS
  3. Not natively. eg http://www.devopsa.net/2014/04/three-way-to-use-logstash-with-hadoop.html
  4. Can you elaborate here, what do you mean (though see 2)?

On 3 May 2015 at 19:21, Lior Goldemberg <[hidden email]> wrote:
hi,

i have few basic questions about es-hadoop,
and i would really appreciate your kind help 

1. if i have currently ES cluster, do i have motivation to add hadoop layer?

2. is the idea of ES-hadoop, that hadoop will be the data store, and ES the search engine above it?

3. can logstash write to hadoop?

4. when i run queries to ES, does it go to HDFS in real time?

thanks a lot!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cf3e6331-2ac2-4967-89b9-a60f2967b8de%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8gTi7kpTxEgN__Z7zrr2vDXOY%3DsdNmZ6XHC3xazkYxTA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [hadoop] newbie question

Costin Leau
To add to Mark's answer:

1. Hadoop means a lot of things so typically, if you are not familiar
with it or not a user, the answer tends to be no
2. No. Data is indexed from Hadoop to Elasticsearch or vice-versa. see
elastic.co/hadoop and the various presentations on this topic. Again,
es-hadoop is meant for Hadoop users trying to leverage Elasticsearch
3.  Mark already replied
4. No - see 2. Note that HDFS is an archiving store so accessing in
"real-time" means slow access especially for random access.

On Sun, May 3, 2015 at 1:01 PM, Mark Walkom <[hidden email]> wrote:

> Maybe, depends on your use case.
> No, they connect but ES does not store data on HDFS
> Not natively. eg
> http://www.devopsa.net/2014/04/three-way-to-use-logstash-with-hadoop.html
> Can you elaborate here, what do you mean (though see 2)?
>
>
> On 3 May 2015 at 19:21, Lior Goldemberg <[hidden email]> wrote:
>>
>> hi,
>>
>> i have few basic questions about es-hadoop,
>> and i would really appreciate your kind help
>>
>> 1. if i have currently ES cluster, do i have motivation to add hadoop
>> layer?
>>
>> 2. is the idea of ES-hadoop, that hadoop will be the data store, and ES
>> the search engine above it?
>>
>> 3. can logstash write to hadoop?
>>
>> 4. when i run queries to ES, does it go to HDFS in real time?
>>
>> thanks a lot!
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [hidden email].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/cf3e6331-2ac2-4967-89b9-a60f2967b8de%40googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [hidden email].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8gTi7kpTxEgN__Z7zrr2vDXOY%3DsdNmZ6XHC3xazkYxTA%40mail.gmail.com.
>
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJogdmcDZ%2BZGjCkJX46iBYX4tEurRSqR1kScfxV-tRLxW3vD0w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [hadoop] newbie question

liorg2
thanks guys,

i have an app that needs to write big data, currently writes directly to ES.

also i have really heavy aggregations (scripted metric), which takes a long time (few min)
 
since i know that ES supposed to be a search engine and not DB (by their claim), i started to look for Solutions, and i thought of the following form:

 1. write to HDFS
 2. index the data after manipulations, that will save me expensive aggregations
 3.run queries on ES 
 
does it make sense?
you know a better scalable option?

Lior

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7aafd5c5-cae4-409d-90d7-719ec78155d3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [hadoop] newbie question

Costin Leau
Yes that works. It looks like you are only using HDFS and none of the
computational components of Hadoop (Map/Reduce, Hive, Spark, etc...)
thus you could just import the data from HDFS to Elasticsearch with or
without Hadoop.
With Hadoop you get parallelism but you need to learn (if you haven't
already) one of the compute frameworks out there (there are plenty of
options).  Without it you can get a potentially easier solution that
doesn't parallelize the input (which tends to be a problem only after
a certain size).

What option fits your scenario depends on your requirements really.

I don't want to stray you away from Hadoop (I'm the lead of es-hadoop
project) rather point out that it is not the only solution out there
and that it comes with a cost.

Cheers,

On Sun, May 3, 2015 at 1:46 PM, Lior Goldemberg <[hidden email]> wrote:

>> thanks guys,
>>
>> i have an app that needs to write big data, currently writes directly to
>> ES.
>>
>> also i have really heavy aggregations (scripted metric), which takes a
>> long time (few min)
>
>
>>
>> since i know that ES supposed to be a search engine and not DB (by their
>> claim), i started to look for Solutions, and i thought of the following
>> form:
>
>
>  1. write to HDFS
>  2. index the data after manipulations, that will save me expensive
> aggregations
>  3.run queries on ES
>
> does it make sense?
> you know a better scalable option?
>
> Lior
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [hidden email].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/7aafd5c5-cae4-409d-90d7-719ec78155d3%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJogdmfXyMrvTkqiB5A0CZKWq2DgNWxScen7bT97_y8fskACPQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [hadoop] newbie question

Mark Walkom-2
In reply to this post by liorg2
ES can do aggregations very quickly, it really depends on what you want to do.

I'd suggest trying ES and seeing if it can do what you want, and go from there.

On 3 May 2015 at 20:46, Lior Goldemberg <[hidden email]> wrote:
thanks guys,

i have an app that needs to write big data, currently writes directly to ES.

also i have really heavy aggregations (scripted metric), which takes a long time (few min)
 
since i know that ES supposed to be a search engine and not DB (by their claim), i started to look for Solutions, and i thought of the following form:

 1. write to HDFS
 2. index the data after manipulations, that will save me expensive aggregations
 3.run queries on ES 
 
does it make sense?
you know a better scalable option?

Lior

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7aafd5c5-cae4-409d-90d7-719ec78155d3%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X_Sv8DV6vX0SZM86H-D4UQAM%3DBXipq9vuJrjZc1qYNaJQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [hadoop] newbie question

liorg2
In reply to this post by Costin Leau
Hi guys,

Thanks again for the quick replies, very much appreciated!!

We are using ES for the past year and from day 1 we haven’t had good perforce for groovy script that use scripted metric aggregations.

Our data is not huge yet, we have 163 indices, 356 shards and 360M documents but when we run the groovy script it can take up to 2-3 minutes. From our understanding it should run much faster.

thus we are afraid from the future, when data becomes lots bigger, after the beta stage 


now i'm not sure whats the different between Hadoop and HDFS.
is Hadoop an engine that runs over HDFS? 

btw, my complicated scenario, is that i have tons of events, with fields: event type, user id, date,.. [lots more]...

for example : 
{ userid:1, event_type:A,date:03/05/2015 14:25:01} 
{ userid:1, event_type:T,date:03/05/2015 14:25:02} 
{ userid:1, event_type:S,date:03/05/2015 14:25:03} 
{ userid:1, event_type:Z,date:03/05/2015 14:25:04} 
{ userid:1, event_type:B,date:03/05/2015 14:25:05} 

in the query, i need to find specific flows of users, and not necessary in a roll , for example: A->S->Z needs to return the user above, wither all the relevant docs.
when using scripted metric aggregation, it takes a long time, and moreover- takes lots of memory, and sometimes kill the ES

can Hadoop help me with it?
i thought of creating a list of events per user (currently i have a type "events" in a daily index, with list of events ordered by date and time, and the user is a field in this type )

thanks again!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b4fba6db-18ca-4365-af8e-9f95a8571075%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [hadoop] newbie question

liorg2
come on guys, you were so helpful so far :)

On Sun, May 3, 2015 at 2:32 PM, Lior Goldemberg <[hidden email]> wrote:
Hi guys,

Thanks again for the quick replies, very much appreciated!!

We are using ES for the past year and from day 1 we haven’t had good perforce for groovy script that use scripted metric aggregations.

Our data is not huge yet, we have 163 indices, 356 shards and 360M documents but when we run the groovy script it can take up to 2-3 minutes. From our understanding it should run much faster.

thus we are afraid from the future, when data becomes lots bigger, after the beta stage 


now i'm not sure whats the different between Hadoop and HDFS.
is Hadoop an engine that runs over HDFS? 

btw, my complicated scenario, is that i have tons of events, with fields: event type, user id, date,.. [lots more]...

for example : 
{ userid:1, event_type:A,date:03/05/2015 14:25:01} 
{ userid:1, event_type:T,date:03/05/2015 14:25:02} 
{ userid:1, event_type:S,date:03/05/2015 14:25:03} 
{ userid:1, event_type:Z,date:03/05/2015 14:25:04} 
{ userid:1, event_type:B,date:03/05/2015 14:25:05} 

in the query, i need to find specific flows of users, and not necessary in a roll , for example: A->S->Z needs to return the user above, wither all the relevant docs.
when using scripted metric aggregation, it takes a long time, and moreover- takes lots of memory, and sometimes kill the ES

can Hadoop help me with it?
i thought of creating a list of events per user (currently i have a type "events" in a daily index, with list of events ordered by date and time, and the user is a field in this type )

thanks again!

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/0XCJ1PS2H2o/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b4fba6db-18ca-4365-af8e-9f95a8571075%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPYrotrxU5QaPu2A7kBNEhUZxc0J4OnjDTwf3eabRBkhU0%2Bxhw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [hadoop] newbie question

Christian Dahlqvist
In reply to this post by liorg2
Hi,

I am sure Hadoop can help you calculate this, but you may also be able to go about this more efficiently in Elasticsearch. If you, as you mentioned, were to create a user centric index in addition to the event centric one that you have got, you could store a list of all the events belonging to a user there. This would allow you to efficiently identify the users that have all the required events through a simple query, and then just process these to verify that the order is correct, which is likely to scale and perform much better than the current approach. This is what is usually referred to as entity-centric indexing [1].

As updating the user centric index for every event inserted can often be expensive, a common approach is to create a batch job that periodically retrieves all new events, aggregates these per user and updates the user index. This will mean that the user index will not be completely up to date all the time, but as you spread out the processing work, it can make queries much more efficient.

[1] https://www.elastic.co/videos/entity-centric-indexing-london-meetup-sep-2014

Best regards,

Christian


On Sunday, 3 May 2015 10:21:35 UTC+1, Lior Goldemberg wrote:
hi,

i have few basic questions about es-hadoop,
and i would really appreciate your kind help 

1. if i have currently ES cluster, do i have motivation to add hadoop layer?

2. is the idea of ES-hadoop, that hadoop will be the data store, and ES the search engine above it?

3. can logstash write to hadoop?

4. when i run queries to ES, does it go to HDFS in real time?

thanks a lot!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/891973ff-14be-4720-9895-d7e6581b2323%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [hadoop] newbie question

liorg2
Thanks a lot
Very appreciated 

On Sunday, May 3, 2015, Christian Dahlqvist <[hidden email]> wrote:
Hi,

I am sure Hadoop can help you calculate this, but you may also be able to go about this more efficiently in Elasticsearch. If you, as you mentioned, were to create a user centric index in addition to the event centric one that you have got, you could store a list of all the events belonging to a user there. This would allow you to efficiently identify the users that have all the required events through a simple query, and then just process these to verify that the order is correct, which is likely to scale and perform much better than the current approach. This is what is usually referred to as entity-centric indexing [1].

As updating the user centric index for every event inserted can often be expensive, a common approach is to create a batch job that periodically retrieves all new events, aggregates these per user and updates the user index. This will mean that the user index will not be completely up to date all the time, but as you spread out the processing work, it can make queries much more efficient.


Christian


On Sunday, 3 May 2015 10:21:35 UTC+1, Lior Goldemberg wrote:
hi,

i have few basic questions about es-hadoop,
and i would really appreciate your kind help 

1. if i have currently ES cluster, do i have motivation to add hadoop layer?

2. is the idea of ES-hadoop, that hadoop will be the data store, and ES the search engine above it?

3. can logstash write to hadoop?

4. when i run queries to ES, does it go to HDFS in real time?

thanks a lot!

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/0XCJ1PS2H2o/unsubscribe.
To unsubscribe from this group and all its topics, send an email to <a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;elasticsearch%2Bunsubscribe@googlegroups.com&#39;);" target="_blank">elasticsearch+unsubscribe@....
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/891973ff-14be-4720-9895-d7e6581b2323%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPYrotrbMGA6_sN%3DabqCexW%3DeSw%3DKJ1w2oX2BFbDj-MfVj0biw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.