Storing and analyzing user agent strings, general approach

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Storing and analyzing user agent strings, general approach

Mark Dodwell
I want to store a bunch of documents in elasticsearch (which represent a hit to a website) including the user agent of the client that made the original HTTP request.

Since user agent strings have a lot of variance, and the useful parts need parsing out (OS, browser, version etc.) I would like to be able to perform aggregations on those extracted features.

The simplest way I can think to do this would be to analyze the user agent string before indexing the document. The downside to this approach is as new/different user agent strings emerge (which is not unlikely) you would have to proactively update the parser.

This may be impossibly/undesirable for a number of reasons, but what I'd really like to do is index the raw user agent string and then perform the analysis/feature extraction post-hoc at query time. Any ideas/pointers on how to do this?

Aggregators? Custom analyzers? (How would you handle an update to the analyzer, would you need to re-run against all existing stored data?)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ed9bf030-f9bf-480a-88b1-a80421b9e79e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Storing and analyzing user agent strings, general approach

Patrick Proniewski
Hi,

You should give http://logstash.net/docs/1.4.2/filters/useragent a try before anything else.

Here is the relevant part of logstash.conf I'm using:

filter {
        if [type] == "apache" {
                if [user-agent] != "-" and [user-agent] != "" {
                  useragent {
                        add_tag => [ "UA" ]
                        source => "user-agent"
                  }
                }
                if "UA" in [tags] {
                        if [device] == "Other" { mutate { remove_field => "device" } }
                        if [name]   == "Other" { mutate { remove_field => "name" } }
                        if [os]     == "Other" { mutate { remove_field => "os" } }
                }
        }
}

It retains the full user-agent field, and add nice fileds like "device", "major" and "minor" version, "name", "os", "os_name", "os_major" and "os_minor".

sample:
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.76.4 (KHTML, like Gecko) Version/6.1.4 Safari/537.76.4",
    "name": "Safari",
    "os": "Mac OS X 10.8.5",
    "os_name": "Mac OS X",
    "os_major": "10",
    "os_minor": "8",
    "major": "6",
    "minor": "1",
    "patch": "4",


On 26 juin 2014, at 09:09, Mark Dodwell <[hidden email]> wrote:

> I want to store a bunch of documents in elasticsearch (which represent a
> hit to a website) including the user agent of the client that made the
> original HTTP request.
>
> Since user agent strings have a lot of variance, and the useful parts need
> parsing out (OS, browser, version etc.) I would like to be able to perform
> aggregations on those extracted features.
>
> The simplest way I can think to do this would be to analyze the user agent
> string before indexing the document. The downside to this approach is as
> new/different user agent strings emerge (which is not unlikely) you would
> have to proactively update the parser.
>
> This may be impossibly/undesirable for a number of reasons, but what I'd
> really like to do is index the raw user agent string and then perform the
> analysis/feature extraction post-hoc at query time. Any ideas/pointers on
> how to do this?
>
> Aggregators? Custom analyzers? (How would you handle an update to the
> analyzer, would you need to re-run against all existing stored data?)
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ed9bf030-f9bf-480a-88b1-a80421b9e79e%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/48F3FC4E-D872-43B9-A60D-D2755094AC85%40patpro.net.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Storing and analyzing user agent strings, general approach

Mark Dodwell
Thanks, lots of useful stuff there. 

--

Sent from Mailbox for iPhone


On Thu, Jun 26, 2014 at 12:34 AM, Patrick Proniewski <[hidden email]> wrote:

Hi,

You should give http://logstash.net/docs/1.4.2/filters/useragent a try before anything else.

Here is the relevant part of logstash.conf I'm using:

filter {
if [type] == "apache" {
if [user-agent] != "-" and [user-agent] != "" {
useragent {
add_tag => [ "UA" ]
source => "user-agent"
}
}
if "UA" in [tags] {
if [device] == "Other" { mutate { remove_field => "device" } }
if [name] == "Other" { mutate { remove_field => "name" } }
if [os] == "Other" { mutate { remove_field => "os" } }
}
}
}

It retains the full user-agent field, and add nice fileds like "device", "major" and "minor" version, "name", "os", "os_name", "os_major" and "os_minor".

sample:
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.76.4 (KHTML, like Gecko) Version/6.1.4 Safari/537.76.4",
"name": "Safari",
"os": "Mac OS X 10.8.5",
"os_name": "Mac OS X",
"os_major": "10",
"os_minor": "8",
"major": "6",
"minor": "1",
"patch": "4",


On 26 juin 2014, at 09:09, Mark Dodwell <[hidden email]> wrote:


> I want to store a bunch of documents in elasticsearch (which represent a
> hit to a website) including the user agent of the client that made the
> original HTTP request.
>
> Since user agent strings have a lot of variance, and the useful parts need
> parsing out (OS, browser, version etc.) I would like to be able to perform
> aggregations on those extracted features.
>
> The simplest way I can think to do this would be to analyze the user agent
> string before indexing the document. The downside to this approach is as
> new/different user agent strings emerge (which is not unlikely) you would
> have to proactively update the parser.
>
> This may be impossibly/undesirable for a number of reasons, but what I'd
> really like to do is index the raw user agent string and then perform the
> analysis/feature extraction post-hoc at query time. Any ideas/pointers on
> how to do this?
>
> Aggregators? Custom analyzers? (How would you handle an update to the
> analyzer, would you need to re-run against all existing stored data?)
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ed9bf030-f9bf-480a-88b1-a80421b9e79e%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/H-sUPppQMp8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/48F3FC4E-D872-43B9-A60D-D2755094AC85%40patpro.net.
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1403899436158.1403ba21%40Nodemailer.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Storing and analyzing user agent strings, general approach

Patrick Proniewski
I just realize that the "user-agent" field comes from my Apache config, where I define a JSON logging format:

LogFormat "{ \"@timestamp\": \"%{%Y-%m-%dT%H:%M:%S%z}t\", \"message\": \"%r\", \"host\": \"%v\", \"user-agent\": \"%{User-agent}i\", \"client\": \"%a\", \"duration_usec\": %D, \"duration_sec\": %T, \"status\": %s, \"size\": %B, \"request_path\": \"%U\", \"request\": \"%U%q\", \"method\": \"%m\", \"referrer\": \"%{Referer}i\" }" logstash_ext_json

everything else is in my first email.

On 27 juin 2014, at 22:03, Mark Dodwell wrote:

> Thanks, lots of useful stuff there.
>
> --
>
> Sent from Mailbox for iPhone
>
>
> On Thu, Jun 26, 2014 at 12:34 AM, Patrick Proniewski <[hidden email]> wrote:
>
> Hi,
>
> You should give http://logstash.net/docs/1.4.2/filters/useragent a try before anything else.
>
> Here is the relevant part of logstash.conf I'm using:
>
> filter {
> if [type] == "apache" {
> if [user-agent] != "-" and [user-agent] != "" {
> useragent {
> add_tag => [ "UA" ]
> source => "user-agent"
> }
> }
> if "UA" in [tags] {
> if [device] == "Other" { mutate { remove_field => "device" } }
> if [name] == "Other" { mutate { remove_field => "name" } }
> if [os] == "Other" { mutate { remove_field => "os" } }
> }
> }
> }
>
> It retains the full user-agent field, and add nice fileds like "device", "major" and "minor" version, "name", "os", "os_name", "os_major" and "os_minor".
>
> sample:
> "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.76.4 (KHTML, like Gecko) Version/6.1.4 Safari/537.76.4",
> "name": "Safari",
> "os": "Mac OS X 10.8.5",
> "os_name": "Mac OS X",
> "os_major": "10",
> "os_minor": "8",
> "major": "6",
> "minor": "1",
> "patch": "4",
>
>
> On 26 juin 2014, at 09:09, Mark Dodwell <[hidden email]> wrote:
>
> > I want to store a bunch of documents in elasticsearch (which represent a
> > hit to a website) including the user agent of the client that made the
> > original HTTP request.
> >
> > Since user agent strings have a lot of variance, and the useful parts need
> > parsing out (OS, browser, version etc.) I would like to be able to perform
> > aggregations on those extracted features.
> >
> > The simplest way I can think to do this would be to analyze the user agent
> > string before indexing the document. The downside to this approach is as
> > new/different user agent strings emerge (which is not unlikely) you would
> > have to proactively update the parser.
> >
> > This may be impossibly/undesirable for a number of reasons, but what I'd
> > really like to do is index the raw user agent string and then perform the
> > analysis/feature extraction post-hoc at query time. Any ideas/pointers on
> > how to do this?
> >
> > Aggregators? Custom analyzers? (How would you handle an update to the
> > analyzer, would you need to re-run against all existing stored data?)
> >
> > --
> > You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> > To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ed9bf030-f9bf-480a-88b1-a80421b9e79e%40googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/H-sUPppQMp8/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to [hidden email].
> To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/48F3FC4E-D872-43B9-A60D-D2755094AC85%40patpro.net.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1403899436158.1403ba21%40Nodemailer.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/DF64C8A3-2285-4B8C-9B56-2201C4649EF1%40patpro.net.
For more options, visit https://groups.google.com/d/optout.