diff options
author | Yorick Peterse <yorickpeterse@gmail.com> | 2016-01-13 12:57:46 +0100 |
---|---|---|
committer | Yorick Peterse <yorickpeterse@gmail.com> | 2016-01-13 12:57:46 +0100 |
commit | 057eb824b5d7f38547506303dc80da6164715420 (patch) | |
tree | 03c28d94dcd34f80af6c2132284062f19e416e00 | |
parent | 23671600150cb022a09a77b8ea56a9465f19a013 (diff) | |
download | gitlab-ce-057eb824b5d7f38547506303dc80da6164715420.tar.gz |
Randomize metrics sample intervalsconfigure-randomize-metrics-sample-interval
Sampling data at a fixed interval means we can potentially miss data
from events occurring between sampling intervals. For example, say we
sample data every 15 seconds but Unicorn workers get killed after 10
seconds. In this particular case it's possible to miss interesting data
as the sampler will never get to actually submitting data.
To work around this (at least for the most part) the sampling interval
is randomized as following:
1. Take the user specified sampling interval (15 seconds by default)
2. Divide it by 2 (referred to as "half" below)
3. Generate a range (using a step of 0.1) from -"half" to "half"
4. Every time the sampler goes to sleep we'll grab the user provided
interval and add a randomly chosen "adjustment" to it while making
sure we don't pick the same value twice in a row.
For a specified timeout of 15 this means the actual intervals can be
anywhere between 7.5 and 22.5, but never can the same interval be used
twice in a row.
The rationale behind this change is that on dev.gitlab.org I'm sometimes
seeing certain Gitlab::Git/Rugged objects being retained, but only for a
few minutes every 24 hours. Knowing the code of Gitlab and how much
memory it uses/leaks I suspect we're missing data due to workers getting
terminated before the sampler can write its data to InfluxDB.
-rw-r--r-- | lib/gitlab/metrics/sampler.rb | 28 | ||||
-rw-r--r-- | spec/lib/gitlab/metrics/sampler_spec.rb | 22 |
2 files changed, 46 insertions, 4 deletions
diff --git a/lib/gitlab/metrics/sampler.rb b/lib/gitlab/metrics/sampler.rb index c2913841de3..fc709222a9b 100644 --- a/lib/gitlab/metrics/sampler.rb +++ b/lib/gitlab/metrics/sampler.rb @@ -8,8 +8,13 @@ module Gitlab class Sampler # interval - The sampling interval in seconds. def initialize(interval = Metrics.settings[:sample_interval]) - @interval = interval - @metrics = [] + interval_half = interval.to_f / 2 + + @interval = interval + @interval_steps = (-interval_half..interval_half).step(0.1).to_a + @last_step = nil + + @metrics = [] @last_minor_gc = Delta.new(GC.stat[:minor_gc_count]) @last_major_gc = Delta.new(GC.stat[:major_gc_count]) @@ -26,7 +31,7 @@ module Gitlab Thread.current.abort_on_exception = true loop do - sleep(@interval) + sleep(sleep_interval) sample end @@ -102,6 +107,23 @@ module Gitlab def sidekiq? Sidekiq.server? end + + # Returns the sleep interval with a random adjustment. + # + # The random adjustment is put in place to ensure we: + # + # 1. Don't generate samples at the exact same interval every time (thus + # potentially missing anything that happens in between samples). + # 2. Don't sample data at the same interval two times in a row. + def sleep_interval + while step = @interval_steps.sample + if step != @last_step + @last_step = step + + return @interval + @last_step + end + end + end end end end diff --git a/spec/lib/gitlab/metrics/sampler_spec.rb b/spec/lib/gitlab/metrics/sampler_spec.rb index 27211350fbe..38da77adc9f 100644 --- a/spec/lib/gitlab/metrics/sampler_spec.rb +++ b/spec/lib/gitlab/metrics/sampler_spec.rb @@ -9,7 +9,7 @@ describe Gitlab::Metrics::Sampler do describe '#start' do it 'gathers a sample at a given interval' do - expect(sampler).to receive(:sleep).with(5) + expect(sampler).to receive(:sleep).with(a_kind_of(Numeric)) expect(sampler).to receive(:sample) expect(sampler).to receive(:loop).and_yield @@ -116,4 +116,24 @@ describe Gitlab::Metrics::Sampler do sampler.add_metric('cats', value: 10) end end + + describe '#sleep_interval' do + it 'returns a Numeric' do + expect(sampler.sleep_interval).to be_a_kind_of(Numeric) + end + + # Testing random behaviour is very hard, so treat this test as a basic smoke + # test instead of a very accurate behaviour/unit test. + it 'does not return the same interval twice in a row' do + last = nil + + 100.times do + interval = sampler.sleep_interval + + expect(interval).to_not eq(last) + + last = interval + end + end + end end |