Best practices for dealing with a large number of small activity stream events
I'm looking at using elasticsearch for a use case that I'd love some feedback on regarding best practices.
A little background... I've been digging into various approaches to allowing interactive drill down slicing dicing of activity stream data ( actor / verb / target ) user data for realtime analytics for end users. This is high dimensional data that has too many potential ways to view to effectively precompute rollups. Other systems out there that try to tackle this similar problem that I have played around with are Druid, OpenTSDB, Blueflood, InfluxDB. At the end of the day they either all use an inverted index or have or are planning to have elasticsearch integrations, so I figure why not stick with ES.
There are three areas I am trying to optimize:
- Minimize the index footprint on disk.
- Minimize the RAM footprint
- Maximize the speed
I believe the key tradeoff I need to make with my dataset is going to doc_values and whether or not I try to utilize heap or page cache.
All my fields are straight exact match not analyzed fields and there are ~15 of them. "not_analyzed" appears to have all the extras that can cause bloat disabled (norms, frequencies, etc). I am not indexing source. Here is my index template: