Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce 'nodejs' field set #480

Merged
merged 2 commits into from
Jun 10, 2019
Merged

Conversation

axw
Copy link
Member

@axw axw commented Jun 5, 2019

Node.js-specific metrics pulled from elastic/apm-agent-nodejs#1021

Relates to #474

I have some questions of my own.

  1. ECS has one existing field in one field set for a duration: event.duration, which has no unit suffix. There is an open proposal to add uptime fields which also lack suffixes. Does it make sense to include "ns" as the suffix in nodejs.eventloop.delay.avg.ns? It's a bit of an eyesore, and is inconsistent with event.duration... but how else does a user know what the unit is?

  2. The ECS guidelines state that one should "avoid abbreviations when possible". On the other hand, in the Beats dev guide standardised names specifies using "avg" instead of "average". Which way do we go?

  3. nodejs.memory.heap.allocated.bytes and nodejs.memory.heap.used.bytes work well because they have "bytes" at the end, but for some other non-standard unit, what should we do? For example, suppose we were talking about allocated/freed objects:

    • nodejs.memory.heap.objects_used? Then we're not ending the phrase with a unit.
    • nodejs.memory.used_objects? Then we don't have a common prefix for auto-completion.

@axw axw requested review from webmat, watson, Qard and ruflin June 5, 2019 08:21
@ruflin
Copy link
Contributor

ruflin commented Jun 5, 2019

All good questions where we don't have a definitive answer yet but we should come to a conclusion. Let me open Pandoras box.

I think units do not belong in the field name but should be part of the field type. We added them to the field name in Metricbeat as it was the simplest solution back then. I would rather have support for elastic/elasticsearch#31244 This would make field names shorter, what unit was ingested or queried would not matter as Elasticsearch would standardise it.

The problem is we don't have elastic/elasticsearch#31244 yet but I think we should push for it. Assuming we add units now, we will run into a conflict later. Let's assume we have foo.bytes today and want to migrate to foo which is a byte field. The two fields cannot coexist as one is an object an other a simple "field". This will make alias impossible. So I would rather have now foo and we describe that it is in bytes and have a breaking change later and change the field type. I would expect queries which only use bytes to still work across both fields but allow to query based on GB on the newer indices.

How does the user know today which unit it is in? Obviously docs plays a part here but I think we can go further and either get support for it in the index pattern in Kibana or do something similar like the SIEM app which takes the defined fields and then exposes it on their own: https://github.com/elastic/kibana/tree/master/x-pack/plugins/siem/server/utils/beat_schema/8.0.0

Now let's get into abbreviations. I unfortunately agree with both ways. In general we should avoid abbreivations but there are exceptions to it. For me the exception are where the abbreviation is so well know that it can't be misinterpreted (or at least I hope it can't). Assuming we get bytes and duration type in Elasticsearch, half of the abbreviations in the current list will not be needed anymore. The other part is around metric types, like avg, count, min, max, pct, norm. (I'm ignoring request, message, and connection which I would argue today should not be in the list. These fields are basically a summary of multiple metrics and we build an object out of it to describe the boundaries.

Again I think it would be great to have a type for this in Elasticsearch that would enforce on how max, min, avg etc. should be stored. I remember there have been many discussions around this in the past but I think we never got to a full conclusion. I think apm-server (@graphaelli ) already today uses some standard on how to store such metrics? Something like:

{
    "foo": {
        "avg": 12,
        "min": 10,
        "max": 20
    }
}

Until we have a type in Elasticsearch, my suggestion would be to store it as an object and use abbreviations here. This also ties partially in the unit part above. If we include the unit in the name, this would become even longer. If we don't, things stay simpler. Assuming we will get a metric type in Elasticsearch, it will become interesting if a metric type can also have a unit?

Not sure I fully follow your last example. What would be the unit be for objects_used?

@axw
Copy link
Member Author

axw commented Jun 5, 2019

@ruflin I tend to agree with everything you've said.

Not sure I fully follow your last example. What would be the unit be for objects_used?

The unit for "nodejs.memory.heap.objects_used" is "object". I think in cases like this where there's no associated magnitude (as there is with "nanosecond", "kilobyte", etc.), there's no need for an explicit unit. Units would be used by the UI to render the values differently, accept alternative query formats, etc. For something like "objects_used", it's just a number.

If we do go down the route of removing units from names, then I think it would be fine for the metric name to be nodejs.memory.heap.objects.used.

So if we did all that, here's how the Node.js metric names would look:

  • nodejs.active_handles (maybe nodejs.handles.active?)
  • nodejs.active_requests (maybe nodejs.requests.active?)
  • nodejs.eventloop.delay.avg
  • nodejs.memory.heap.allocated
  • nodejs.memory.heap.used

These aligns well with the JVM metrics that the APM Java agent is already publishing:

  • jvm.memory.heap.used
  • jvm.memory.heap.committed
  • jvm.memory.heap.max
  • jvm.memory.non_heap.used
  • jvm.memory.non_heap.committed
  • jvm.memory.non_heap.max
  • jvm.thread.count
  • jvm.gc.count
  • jvm.gc.time
  • jvm.gc.alloc

The Go agent's metrics aren't aligned, but they can be changed later.

@webmat
Copy link
Contributor

webmat commented Jun 5, 2019

I agree with continuing to keep units out of field names as much as possible, documenting them for now and eventually having better support for units in Elasticsearch.

Count metrics such as "objects" and "handles" should absolutely remain without a unit, too.

I'm also good with standardizing on some more shortened names for metrics such as avg, pct, min, max. I think they're all pretty standard, and won't confuse anyone. Plus I don't think anyone will really insist on having nodejs.eventloop.delay.average.milliseconds, right? 😂

On the other hand, I'm still absolutely not convinced that we should add a multitude of language/runtime names in ECS, however. Even if different runtimes have different sets of important metrics (e.g. eventloop in Node, more focus on heap/non-heap in Java). I have limited time today to make my case, perhaps we should find a time to discuss live :-)

But I do think we could define one general purpose field set, that encapsulates all the metrics commonly seen. Then each language populates what's relevant to them. And of course we can add fields to identify the runtime, rather than having this identification be done (and limited to) a field nesting.

Benefits of having one nesting for runtime metrics:

  • We can store more metadata about the runtime. Not only runtime.name: jvm, but also its runtime.version, or even other tidbits like language (e.g. jvm & runtime.language.name:scala, runtime.language.version).
  • People can decide to track metrics that wouldn't be associated with a given runtime in the general case, but is enabled by a library. For example the event loop metrics could be populated when a Ruby or Python project uses the relevant libraries to do event-based code.
  • Makes it trivial to support languages whose runtime have significant underlying changes from one version to another. You just start populating what makes sense, in the new version.
    • Example: Ruby moving from green threads to OS threads between 1.8 and 1.9. Since it's a common field set, we simply populate the metrics that make sense for 1.8, and for 1.9. Granted, Ruby 1.8 to 1.9 was a long time ago, but shifts like this are bound to happen again for other runtimes.
  • It's possible to build common content such as dashboards and ML jobs that can be used for all similar languages at the same time.
    • Note that obviously languages with wildly different usage patterns (Node & Java) wouldn't benefit from that. But languages with really similar usage patterns such as Python and Ruby would likely benefit.
  • Once we have a solid field set defined that covers most languages, the field set should remain stable. If we go for one field set per runtime, the maintenance will be much higher: we'll constantly be asked to add missing languages (popular or not); needing to fix multiple field sets, when we start collecting more metrics for a given kind of runtime.

I'm sure more ideas will come to mind later, but let's start with these.

I'd like to know what are the benefits of having one field set per runtime, can you elaborate on that?

@webmat
Copy link
Contributor

webmat commented Jun 5, 2019

In addition, seeing a paste of one set of metrics / one event for each runtime would help make the discussion more concrete.

@axw Could you provide a gist of that?

@ruflin
Copy link
Contributor

ruflin commented Jun 5, 2019

The biggest benefit of having a field per runtime I think is progress. We need a place for them now and today. I could see us adding these fields now in some way to ECS (we should discuss on how this works) and figure out in the next x months how we can generalise it and potentially have it unified for ECS 2.0.

I rather have everyone using now a standard that we change later and have a migration path then not having a standard. We should discuss in more detail "how" we put it in ECS, for example as experimental, but I see this out of scope of this PR. Keep in mind, this PR is against a branch and not master (on purpose), so we can delay this discussion.

In any case we should still introduce some runtime fields for storing the runtime name, language etc. Follow up issue?

@axw I like your proposal. +1 on nodejs.handles.active, nodejs.requests.active.

@webmat
Copy link
Contributor

webmat commented Jun 5, 2019

What I'm hoping to get from having a few metric samples provided as part of this discussion, is insight into how much work it would actually be, if we tried to boil this down to one single field set.

In my limited understanding (until I see the requested data samples), I would expect a few fields to be used across the board, then some fields used only in some cases (some for event loop runtimes, some for threading heavy runtimes, etc). And I think that's totally fine. If most of these subsets are orthogonal, it doesn't really make it complicated to merge all this one one field set.

I can totally accept that implementing a common field set in Mb and APM is out of scope for the work that needs to happen now. But if we want to get this standardized for 8.0, we need to start having the standardization discussion now, not weeks before 8.0 ;-) Do you guys think we can do both? Start the standardization discussion, while you guys align Metricbeat & APM in the short term?

Note that having the standardization discussion now may even surface problems that wouldn't be apparent, when working in one field set per runtime.

@ruflin
Copy link
Contributor

ruflin commented Jun 6, 2019

I think to see if we can standardise it we first need to have first all the fields we want to have standardised in one place. That is also what this brach is for. So my suggestion for moving forward:

  1. Get all the metrics for each runtime into this branch
  2. Get the runtime.* obvious fields into this branch (.name, ...)
  3. Define our standards around metrics (discussion about) and document them
  4. Discuss if we can merge the common metrics into runtime
  5. Open a PR against master

@axw Any chance you could open a second PR against this branch that summarises our discussion around naming and formatting of metrics?

@webmat
Copy link
Contributor

webmat commented Jun 6, 2019

Yeah I don't mind merging whatever's helpful for experimentation in the ecs-metrics branch 👍

But we'll need to address #480 (comment) before merging to master.

@ruflin
Copy link
Contributor

ruflin commented Jun 6, 2019

@axw Could you update the PR that it aligns with our discussion around naming so we can get it in?

@Qard
Copy link

Qard commented Jun 6, 2019

I think of the names as containing an implicit unit type and value intent of integer and count, unless otherwise specified. So nodejs.handles.active is actually basically the same as nodejs.handles.active.count.integer. If something is not a basic integer count, I feel like it helps to have something in the name to communicate to the user what it is. For our own uses, storing that information internally is fine, but makes it less obvious exactly what it is.

- drop units from names
- make "active" a sub-field of "requests" and "handles"
- set eventloop.delay.avg's format to duration
@axw axw force-pushed the ecs-metrics-nodejs branch from 4c3a4ad to ffc6249 Compare June 7, 2019 04:10
@axw
Copy link
Member Author

axw commented Jun 7, 2019

Updated, PTAL.

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, from my perspective we can merge this as is an continue :-)

@Qard Do you see some fields we have in at the moment which would not match your description?

@Qard
Copy link

Qard commented Jun 7, 2019

Well, for example using nodejs.eventloop.delay.avg instead of nodejs.eventloop.delay.avg.ms. Dropping the unit type makes it unclear what the scale is, especially while we don't yet have more granular types in Elasticsearch.

@axw
Copy link
Member Author

axw commented Jun 10, 2019

The problem is that .ms conflicts with other goals.

I'll merge this as is. APM agents might not report metrics initially consistent with this list, as we're going to be making new releases with the runtime metrics pretty soon. We'll keep working on this for eventual consistency.

@axw axw merged commit 56646eb into elastic:ecs-metrics Jun 10, 2019
@axw axw deleted the ecs-metrics-nodejs branch June 10, 2019 04:09
| nodejs.eventloop.delay.avg
| The average event loop delay for the reporting period.

Event loop delay is sampled every 10 milliseconds. Delays shorter than 10 milliseconds may not be observed, for example if a blocking operation starts and ends within the same sampling period.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: The "10 milliseconds" is a configurable number. In the Elastic APM Node.js agent we have currently hard-coded it to 10 ms (which is also the default in Node.js), but the Node.js API we use the extract this information can be configured to any other duration.

Should we make a PR to update this? And if so, what should we write instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah I was perhaps being a bit too specific about how the APM agent implementation works. Yes we should probably update this.

How about this?

Event loop delay is periodically sampled, e.g. every 10 milliseconds, but this may vary by source. Delays shorter than the sampling period may not be observed, for example if a blocking operation starts and ends within the same sampling period.

@roncohen
Copy link
Contributor

(this discussion doesn't fit under this Node.js issue, but since it happened here, I'll include this here)

I just had a discussion with @exekias, @ruflin and @axw. Beats already have a convention to add "bytes" and time unit suffixes (wasn't documented, that's being fixed). There's a big wish to move away from that, to encoding the information in field metadata or similar once ES supports that. Even if we get field level meta data or a field type, generic inputs like Prometheus, might not be able to support that.

Bottom line is, there's a convention and many many fields following it, and since we don't have a different solution to the problem of encoding units, we should rather stick to the current convention until we decide on an alternative convention and have a migration plan to that.

So, we will postfix time based metrics with their unit and "bytes" on the bytes based metrics for now. We will not include these into ECS for now, but just align across agents outside ECS. The java metrics are already out. We will need to eventually deprecate those and move to the right suffices.

@watson
Copy link

watson commented Jun 17, 2019

FYI: The Node.js agent currently also add a pct suffix if the field is percent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants