Trying to find the reason of performance hiccup

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Jae
Reply | Threaded
Open this post in threaded view
|

Trying to find the reason of performance hiccup

Jae
Hi

I am running 36 instances of elasticsearch-0.19.9. Around 11AM, one instance was terminated because of AWS system check and it was replaced immediately. After then, I can see 4 times performance hiccups. Please look at the attached graph. There is a bounded queue from logger and elasticsearch client. If the elasticsearch client is getting slower to be responded from the servers, the queue will be filled up. Red line means dropped messages and green line plummeted means the server was not handling any traffic.

How can I trace it? Also, what should I do to prevent the performance hiccup?

I was trying to find out what happened after 11AM, but I couldn't find anything except connection time out to the dead instances. In the client side, there were no failure, it was just a performance degradation. The following is logging.yml. 

Thank you
Best, Jae

-----------------------------------------------
rootLogger: INFO, console, file
logger:
  # log action execution errors for easier debugging
  action: DEBUG
  # reduce the logging for aws, too much is logged under the default INFO
  com.amazonaws: WARN
  # gateway
  gateway: DEBUG
  #index.gateway: DEBUG

  # peer shard recovery  #indices.recovery: DEBUG

  # discovery
  discovery: TRACE
  index.search.slowlog: TRACE, index_search_slow_log_file

  org.apache: WARN

additivity:
  index.search.slowlog: false

appender:
  console:
    type: console
    layout:      type: consolePattern
      conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
  file:    type: dailyRollingFile
    file: ${path.logs}/${cluster.name}.log
    datePattern: "'.'yyyy-MM-dd"
    layout:
      type: pattern
      conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

  index_search_slow_log_file:
    type: dailyRollingFile
    file: ${path.logs}/${cluster.name}_index_search_slowlog.log
    datePattern: "'.'yyyy-MM-dd"
    layout:
      type: pattern
      conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

--
 
 

eshiccup.tiff (73K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Trying to find the reason of performance hiccup

Otis Gospodnetic
Hi,

What about other ES, JVM, or OS metrics - they may reveal the source.  Maybe shards were being moves around the cluster during those 4 times?  You can get this info from ES. We graph that in SPM and we found this very informative when troubleshooting performance issues.

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html


On Tuesday, September 25, 2012 5:02:13 PM UTC-4, Jae wrote:
Hi

I am running 36 instances of elasticsearch-0.19.9. Around 11AM, one instance was terminated because of AWS system check and it was replaced immediately. After then, I can see 4 times performance hiccups. Please look at the attached graph. There is a bounded queue from logger and elasticsearch client. If the elasticsearch client is getting slower to be responded from the servers, the queue will be filled up. Red line means dropped messages and green line plummeted means the server was not handling any traffic.

How can I trace it? Also, what should I do to prevent the performance hiccup?

I was trying to find out what happened after 11AM, but I couldn't find anything except connection time out to the dead instances. In the client side, there were no failure, it was just a performance degradation. The following is logging.yml. 

Thank you
Best, Jae

-----------------------------------------------
rootLogger: INFO, console, file
logger:
  # log action execution errors for easier debugging
  action: DEBUG
  # reduce the logging for aws, too much is logged under the default INFO
  com.amazonaws: WARN
  # gateway
  gateway: DEBUG
  #index.gateway: DEBUG

  # peer shard recovery  #indices.recovery: DEBUG

  # discovery
  discovery: TRACE
  index.search.slowlog: TRACE, index_search_slow_log_file

  org.apache: WARN

additivity:
  index.search.slowlog: false

appender:
  console:
    type: console
    layout:      type: consolePattern
      conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
  file:    type: dailyRollingFile
    file: ${path.logs}/${cluster.name}.log
    datePattern: "'.'yyyy-MM-dd"
    layout:
      type: pattern
      conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

  index_search_slow_log_file:
    type: dailyRollingFile
    file: ${path.logs}/${cluster.name}_index_search_slowlog.log
    datePattern: "'.'yyyy-MM-dd"
    layout:
      type: pattern
      conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

--
 
 
Reply | Threaded
Open this post in threaded view
|

Re: Trying to find the reason of performance hiccup

ppearcy
If it coincided with a node leaving and joining the cluster it was most likely due to the cold shards that came online. If you're doing sorts or facets, they pull the entire dataset for that field into memory. I believe this feature will mitigate:

Best Regards,
Paul

On Tuesday, September 25, 2012 9:31:37 PM UTC-6, Otis Gospodnetic wrote:
Hi,

What about other ES, JVM, or OS metrics - they may reveal the source.  Maybe shards were being moves around the cluster during those 4 times?  You can get this info from ES. We graph that in SPM and we found this very informative when troubleshooting performance issues.

Otis
--
Performance Monitoring - http://sematext.com/spm/index.html


On Tuesday, September 25, 2012 5:02:13 PM UTC-4, Jae wrote:
Hi

I am running 36 instances of elasticsearch-0.19.9. Around 11AM, one instance was terminated because of AWS system check and it was replaced immediately. After then, I can see 4 times performance hiccups. Please look at the attached graph. There is a bounded queue from logger and elasticsearch client. If the elasticsearch client is getting slower to be responded from the servers, the queue will be filled up. Red line means dropped messages and green line plummeted means the server was not handling any traffic.

How can I trace it? Also, what should I do to prevent the performance hiccup?

I was trying to find out what happened after 11AM, but I couldn't find anything except connection time out to the dead instances. In the client side, there were no failure, it was just a performance degradation. The following is logging.yml. 

Thank you
Best, Jae

-----------------------------------------------
rootLogger: INFO, console, file
logger:
  # log action execution errors for easier debugging
  action: DEBUG
  # reduce the logging for aws, too much is logged under the default INFO
  com.amazonaws: WARN
  # gateway
  gateway: DEBUG
  #index.gateway: DEBUG

  # peer shard recovery  #indices.recovery: DEBUG

  # discovery
  discovery: TRACE
  index.search.slowlog: TRACE, index_search_slow_log_file

  org.apache: WARN

additivity:
  index.search.slowlog: false

appender:
  console:
    type: console
    layout:      type: consolePattern
      conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
  file:    type: dailyRollingFile
    file: ${path.logs}/${cluster.name}.log
    datePattern: "'.'yyyy-MM-dd"
    layout:
      type: pattern
      conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

  index_search_slow_log_file:
    type: dailyRollingFile
    file: ${path.logs}/${cluster.name}_index_search_slowlog.log
    datePattern: "'.'yyyy-MM-dd"
    layout:
      type: pattern
      conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"

--