Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ADR template and metrics in testing ADR. #152

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
182 changes: 182 additions & 0 deletions docs/adr/0000-local-and-test-metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
# Reporting metrics in local development and tests

* Status: Proposed
* Deciders: raphael, lina
* Date: 2022-12-20

## Context and Problem Statement

merino-py emits counts, histograms, and timing metrics as it runs.
In production, these metrics are sent to Datadog, using an extension of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current understanding is that we do not send merino-py metrics to Datadog.

The data flow is as follows:

  1. merino-py uses aiodogstatsd to record and submit StatsD metrics
  2. the aiodogstatsd client is configured to send metrics to a telegraf agent
  3. the telegraf agent uses a StatsD Input and InfluxDB Output
  4. the telegraf agent sends metrics to InfluxDB
  5. Grafana queries InfluxDB and visualizes metrics

Does that sound accurate to you, @dlactin?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, we use the aiodogstatd package because it supports the Datadog extensions to the StatsD protocol which are enabled on telegraf.

[StatsD](https://github.com/statsd/statsd) protocol called [DogStatsD](https://docs.datadoghq.com/developers/dogstatsd).
For local development, merino-py provides the `metrics.dev_logger` config option,
which logs outgoing metrics to the console instead of sending them to Datadog.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

merino-py also has integration tests that cover recording and emitting metrics.
How can we avoid relying on implementation details of the `aiodogstatsd` client library
to suppress reporting to Datadog?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to suppress reporting to Datadog?
to suppress reporting to InfluxDB?


## Decision Drivers

1. Avoid future compatibility hazards when upgrading `aiodogstatsd`.
2. Make it easy for engineers to see what metrics are emitted while working on the code.
3. Maintain test coverage for merino-py's metrics.

## Considered Options

* A. Do nothing.
* B. Always send metrics to a live DogStatsD agent.
* C. Change merino-py's `metrics.Client` proxy to not call into `aiodogstatsd`.
* D. Subclass `aiodogstatsd.Client` to suppress sending metrics to DogStatsD.
* E. Extend merino-py's `metrics` module with pluggable backends for production and testing.
* F. Set up a mock DogStatsD endpoint that collects and checks metrics.
* G. Ask the `aiodogstatsd` authors to expose a reporting hook as part of their public API.

## Decision Outcome

TODO: Fill me in once we decide!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we identified any winner(s) or favorite(s) yet? Or we will fill that in at a later time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another related question: are we using the PR as a discussion where we come to a decision? or is the intent that we have a discussion and decision made elsewhere, and this PR is purely for recording the already made decision? The reason I ask this is because the way I might go about reviewing such a PR will be different based on whether I am meant to discuss and sway a decision, vs a review on its quality as technical documentation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second this. It's a little disorienting between the RFC list and an ADR PR. I'm fine either way, but ADR in my mind is a decision to be recorded whereas an RFC is an idea to be discussed. We should make whatever decision consistent across App Svcs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great question—I'm curious what y'all think! 😄 Reading our RFC process guidelines, I got the impression that it's geared toward significant changes (new features, cross-team projects, design reviews, substantial refactors) seeking feedback from multiple stakeholders, and I wasn't sure that this counted as either.

The line between the RFC list and ADR PRs feels a little fuzzy to me, too—for example, @mhammond (hi Mark! 👋🏼) just put up mozilla/application-services#5302, which is a draft ADR with a chosen solution sent to the RFC list for feedback.

For this one specifically, @hackebrot and I started a discussion on the Jira ticket to talk through some of these options, and we're meeting this week to chat more about them and make a decision. After that, we can update the PR with the outcome. Do we feel like that's a good approach for these kinds of smaller discussions, or would going through the RFC process be valuable—for example, if other folks want to chime in?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a bit messy - it's a classic "if the only tool you have is a hammer, everything looks like a nail". I did an ADR because app-services already has ADRs, but doesn't really have RFCs, and I didn't really want to invent a separate RFC process just for this. The ADR template is also a little too prescriptive IMO and doesn't really suit all such discussions, so my ADR kinda ignored the template a little. It was sent to the RFC list because that was suggested by karloff as the appropriate place to reach the other stakeholders for remote-settings in particular.

So yeah, my ADR was probably closer to an RFC :) I don't really think the distinction is particularly useful - in practice, someone writing an RFC intends making a decision, just like an ADR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the feedback! What I'm understanding from what has been mentioned is that we have a way to solicit feedback on an idea (the RFC process), and that we can link the ADR PR for discussion or some other document. For this particular PR, discussion on the decision is being made in JIRA so this ADR is not the place to discuss the decision. I think this process is fine, and I don't see a strong reason to change it. I think that ADR reviews tend to solicit another round of evaluating the decision, so knowing the process for decision making is important for not delaying actually having a decision made!


## Pros and Cons of the Options

### A. Do nothing.

This option would keep the code as it's written today:

1. When the `metrics.dev_logger` config option is set, `configure_metrics()` sets the pseudo-private
`aiodogstatsd.Client._protocol` property to an instance of `metrics._LocalDatagramLogger`.
`_LocalDatagramLogger` subclasses `aiodogstatsd.client.DatagramProtocol` to log metrics
instead of sending them to DogStatsD.
([DISCO-2089](https://mozilla-hub.atlassian.net/browse/DISCO-2089))
2. merino-py's integration tests patch the pseudo-private `aiodogstatsd.Client._report()` method
with a mock implementation.
([DISCO-2090](https://mozilla-hub.atlassian.net/browse/DISCO-2090))

#### Pros

* Currently working code!
* Doesn't require any engineering work now.
* `aiodogstatsd` is in maintenance mode, and unlikely to break merino-py's usage.

#### Cons

* Relying on non-public APIs could break merino-py if `aiodogstatsd` changes its implementation in the future.
* Assumes that merino-py will only ever report metrics to DogStatsD.

### B. Send metrics to a live DogStatsD agent.

This option would remove the `metrics.dev_logger` config option.
Instead, engineers would need to run a DogStatsD agent locally, and configure merino-py to use that agent.

#### Pros

* A single path for metrics reporting in local development, testing, and production.
* Less code in merino-py for metrics logging.
* Unofficial local DogStatsD agents exist,
written in [Go](https://github.com/jonmorehouse/dogstatsd-local) and
[Ruby](https://github.com/drish/dogstatsd-local).
* Local DogStatsD agents can log metrics in different formats.

#### Cons

* Engineers would need to configure and run a separate service when working locally.
* Requires an unofficial local DogStatsD agent.
The [official DogStatsD agent](https://hub.docker.com/r/datadog/dogstatsd) is only offered as a Docker container
that reports metrics to Datadog, and doesn't support custom or local-only backends.
The plain StatsD agent, which
[does support custom backends](https://github.com/statsd/statsd/blob/master/docs/backend.md),
doesn't support the DogStatsD extensions (histograms) that merino-py uses.
* Only suitable for local development.
Setting up a live DogStatsD agent would be more cumbersome to do in the integration tests.
* Assumes that merino-py will only ever report metrics to DogStatsD.

### C. Change merino-py's `metrics.Client` proxy to not call into `aiodogstatsd` in testing.

merino-py's `metrics.Client` class currently proxies calls to `aiodogstatsd.Client`.
The proxy keeps track of all metrics calls for testing, and adds custom tags for feature flags.
We can extend the proxy to only log metrics to the console if `metrics.dev_logger` is set,
without forwarding them to the underlying `aiodogstatsd.Client`.

#### Pros

* Needs minimal changes to support metrics logging for gauges, counters, histograms, and one-off timings.
* Proxying calls avoids relying on implementation details of `aiodogstatsd`.

#### Cons

* Doesn't work for `aiodogstatsd.Client.{timeit, timeit_task}()`,
which return thunks that call `aiodogstatsd.Client.timing()`.
Since the proxy can't intercept these calls, it would need to reimplement these methods.
* Assumes that merino-py will only ever report metrics to DogStatsD.

### D. Subclass `aiodogstatsd.Client` to suppress sending metrics to StatsD in tests.

This option would change `metrics.Client` to be a subclass of `aiodogstatsd.Client`, instead of proxying calls to it.
The subclass would override the `{gauge, increment, decrement, histogram, distribution, timing}()` methods
to log metrics if `metrics.dev_logger` is set.

#### Pros

* Involves making the same changes as Option C to support methods that don't return anything.
* Works for `aiodogstatsd.Client.{timeit, timeit_task}()` out of the box, without needing to reimplement them.

#### Cons

* Subclassing `aiodogstatsd.Client` relies on implementation details
(for example, that `timeit_task()` calls `timing()` when it's done) that could change
in a future version of `aiodogstatsd`.
* Assumes that merino-py will only ever report metrics to DogStatsD.

### E. Extend merino-py's `metrics` module with pluggable backends for production and testing.

This option would turn `metrics.Client` into a generic metrics interface that can send metrics to any supported
backend: DogStatsD, the console, or another metrics provider (Graphite, InfluxDB) in the future.

#### Pros

* Doesn't rely on implementation details of `aiodogstatsd`.
* Makes it easy to use an integration test backend to verify recorded metrics.
* Can support any number of backends, without requiring substantial changes to merino-py or the metrics interface.

#### Cons

* Incurs the most engineering work out of all our options.
* Introduces another layer of abstraction that might not be useful beyond tests or local development.
* The new interface would need to reimplement convenience methods like `timeit_task()` from `aiodogstatsd`.

### F. Set up a mock DogStatsD endpoint that collects and checks metrics in tests.

This option would have the integration tests set up a UDP socket that collects, parses, and checks
the recorded metrics, following
[the same approach](https://github.com/Gr1N/aiodogstatsd/blob/4c363d795df04d1cc4c137307b7f91592224ed32/tests/conftest.py#L11-L13)
as the `aiodogstatsd` tests.
merino-py's `metrics.Client` would be configured to send metrics to this endpoint in tests.

#### Pros

* Integration tests wouldn't need to patch `aiodogstatsd.Client._report()`.

#### Cons

* The mock socket would need to parse the DogStatsD ping, which could introduce additional bugs.
* Replaces integration test usage only.
This option doesn't address the `metrics.dev_logger` use case, and setting up a mock socket for local development
would be more cumbersome.
* Assumes that merino-py will only ever report metrics to DogStatsD.

### G. Ask the `aiodogstatsd` authors to expose a reporting hook as part of their public API.

This option would involve changing `aiodogstatsd` to expose a public API for intercepting metrics before they're sent.
This could involve exposing `_protocol` and `_report()`, or adding a new API surface.

#### Pros

* Doesn't require substantial engineering work from our team.
* Provides a supported way to log and check metrics without relying on implementation details.

#### Cons

* We'd need to wait for the authors to consider, approve, implement, and publish a new version with our proposal.
This could take an unknown amount of time.
* `aiodogstatsd` is stable (the last commit was December 2021), and the authors could reject
making such a substantial changes to the interface.
* Commits the `aiodogstatsd` authors to maintaining a new API, of which we would be the only consumer.
* Assumes that merino-py will only ever report metrics to DogStatsD.
94 changes: 94 additions & 0 deletions docs/adr/template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# [short title of solved problem and solution]

* Status: [proposed | rejected | accepted | deprecated | … | superseded by [ADR-0005](0005-example.md)] <!-- optional -->
* Deciders: [list everyone involved in the decision] <!-- optional -->
* Date: [YYYY-MM-DD when the decision was last updated] <!-- optional -->

Technical Story: [description | ticket/issue URL] <!-- optional -->

## Context and Problem Statement

[Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]

## Decision Drivers <!-- optional -->

1. [primary driver, e.g., a force, facing concern, …]
2. [secondary driver, e.g., a force, facing concern, …]
3. … <!-- numbers of drivers can vary, ranked in order of importance -->

## Considered Options

* A. [option A]
* B. [option B]
* C. [option C]
* D. … <!-- numbers of options can vary -->

## Decision Outcome

Chosen option:

* A. "[option A]"

[justification. e.g., only option, which meets primary decision driver | which resolves a force or facing concern | … | comes out best (see below)].

### Positive Consequences <!-- optional -->

* [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
* …

### Negative Consequences <!-- optional -->

* [e.g., compromising quality attribute, follow-up decisions required, …]
* …

## Pros and Cons of the Options <!-- optional -->

### [option A]

[example | description | pointer to more information | …] <!-- optional -->

#### Pros

* [argument for]
* [argument for]
* … <!-- numbers of pros can vary -->

#### Cons

* [argument against]
* … <!-- numbers of cons can vary -->

### [option B]

[example | description | pointer to more information | …] <!-- optional -->

#### Pros

* [argument for]
* [argument for]
* … <!-- numbers of pros can vary -->

#### Cons

* [argument against]
* … <!-- numbers of cons can vary -->

### [option C]

[example | description | pointer to more information | …] <!-- optional -->

#### Pros

* [argument for]
* [argument for]
* … <!-- numbers of pros can vary -->

#### Cons

* [argument against]
* … <!-- numbers of cons can vary -->

## Links <!-- optional -->

* [Link type] [Link to ADR] <!-- example: Refined by [ADR-0005](0005-example.md) -->
* … <!-- numbers of links can vary -->