Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify use of hostname, subdomain, domain in source/destination #84

Closed
andrewkroh opened this issue Aug 15, 2018 · 33 comments
Closed

Clarify use of hostname, subdomain, domain in source/destination #84

andrewkroh opened this issue Aug 15, 2018 · 33 comments
Assignees
Labels
question Further information is requested

Comments

@andrewkroh
Copy link
Member

andrewkroh commented Aug 15, 2018

It's not clear to me how to populate the hostname, subdomain, and domain fields of source / destination. More detailed descriptions of each field are needed with examples.

It would probably be helpful to establish some terminology that could be used in clarifying the descriptions.

Terms:

  • FQDN: fully qualified domain name
  • TLD: top level domain (e.g. .com, .net, .bmw, .us) [list of TLDs]
  • eTLD: effective top level domain (e.g. .com, .co.uk and pvt.k12.wy.us) [these get determined with the help of the public suffix list]
  • eTLD+1: effective top level domain plus one level (e.g. example.com, example.co.uk)
  • SLD: second level domain (e.g. co is the SLD of www.example.co.uk)

Examples showing the mappings of these FQDNs to ECS would probably be sufficient to clarify the topic for me.

  • example.com
  • www.example.com
  • www.example.co.uk

Logstash has a TLD filter that uses similar field names, possibly(?) with different meanings.

@andrewkroh andrewkroh added the question Further information is requested label Aug 15, 2018
@webmat webmat mentioned this issue Sep 18, 2018
26 tasks
@MikePaquette
Copy link
Contributor

Thanks @andrewkroh Sorry to take so long to get back to this. This seems easy until you try to spell it out :-). I’d propose the following:

ECS *.hostname should contain the FQDN
ECS *.domain should contain the HRD (highest registered domain) (similar to eTLD+1)
ECS *.subdomain should contain FQDN minus HRD (everything to the left of the HRD)

CASE 1
Let’s look at: example.com where there is no mention of “example.com” in the public suffix list.

host.hostname: "example.com”
device.hostname: "example.com”
source.domain: "example.com”
destination.domain: "example.com”
source.subdomain: ""
destination.subdomain: ""

CASE 2
Let’s look at: myhost.example.com where there is no mention of “example.com” in the public suffix list.

host.hostname: "myhost.example.com”
device.hostname: "mydevice.example.com”
source.domain: "example.com”
destination.domain: "example.com”
source.subdomain: "myhost”
destination.subdomain: "myhost”

CASE 3
Let’s look at: myhost.example.co.uk where “co.uk” is listed in the public suffix list.

host.hostname: "myhost.example.co.uk”
device.hostname: "mydevice.example.co.uk”
source.domain: "example.co.uk”
destination.domain: "example.co.uk”
source.subdomain: "myhost”
destination.subdomain: "myhost”

CASE 4
Let’s look at: myhost.compute.example.co.uk where “co.uk” is listed in the public suffix list.

host.hostname: “myhost.compute.example.co.uk”
device.hostname: “mydevice.compute.example.co.uk”
source.domain: “example.co.uk”
destination.domain: “example.co.uk”
source.subdomain: “myhost.compute”
destination.subdomain: “myhost.compute”

@webmat
Copy link
Contributor

webmat commented Oct 1, 2018

@MikePaquette So the idea is to break down each of these hostnames, wherever they are defined, correct?

I'm asking this because your examples (e.g. Case 2) don't do that explicitly.

Here's how it should be populated, as I understand it:

Case 2
If we have details about a host:
host.hostname: "myhost.example.com”
host.domain: "example.com”
host.subdomain: "myhost”

If we have details about a device
device.hostname: "mydevice.example.com”
device.domain: "example.com”
device.subdomain: "mydevice”

If we have src/dst details about a connection (I changed this part slightly vs your case 2, to illustrate a host talking to an API, for example):
source.hostname: "myhost.example.com”
source.domain: "example.com”
source.subdomain: "myhost”
destination.hostname: "api.example.com”
destination.domain: "example.com”
destination.subdomain: "api”

@MikePaquette
Copy link
Contributor

@webmat yes, that is correct, the same breakdown would apply to each namespace/object/prefix where *.hostname, *.domain, and *.subdomain are defined. Sorry I forgot to include the source.hostname and destination.hostname fields in my examples.

And yes, your example of a host talking to an API is consistent with this definition.

I'll update the entire set of cases with the missing fields for completeness.

@MikePaquette
Copy link
Contributor

Here's an updated set of reference Cases and clarifications, based on @webmat's feedack. Added his example as CASE 5:

ECS *.hostname should contain the FQDN
ECS *.domain should contain the HRD (highest registered domain) (similar to eTLD+1)
ECS *.subdomain should contain FQDN minus HRD (everything to the left of the HRD)

CASE 1
Let’s look at: example.com where there is no mention of “example.com” in the public suffix list. Here's how you'd populate any of the ECS fields that might be relevant to your event:

host.hostname: "example.com”
device.hostname: "example.com”
source.hostname: "example.com”
destination.hostname: "example.com”
source.domain: "example.com”
destination.domain: "example.com”
source.subdomain: ""
destination.subdomain: ""

CASE 2
Let’s look at: myhost.example.com where there is no mention of “example.com” in the public suffix list. Here's how you'd populate any of the ECS fields that might be relevant to your event:

host.hostname: "myhost.example.com”
device.hostname: "mydevice.example.com”
source.hostname: "myhost.example.com”
destination.hostname: "myhost.example.com”
source.domain: "example.com”
destination.domain: "example.com”
source.subdomain: "myhost”
destination.subdomain: "myhost”

CASE 3
Let’s look at: myhost.example.co.uk where “co.uk” is listed in the public suffix list. Here's how you'd populate any of the ECS fields that might be relevant to your event:

host.hostname: "myhost.example.co.uk”
device.hostname: "mydevice.example.co.uk”
source.hostname: "myhost.example.co.uk”
destination.hostname: "myhost.example.co.uk”
source.domain: "example.co.uk”
destination.domain: "example.co.uk”
source.subdomain: "myhost”
destination.subdomain: "myhost”

CASE 4
Let’s look at: myhost.compute.example.co.uk where “co.uk” is listed in the public suffix list. Here's how you'd populate any of the ECS fields that might be relevant to your event:

host.hostname: “myhost.compute.example.co.uk”
device.hostname: “mydevice.compute.example.co.uk”
source.hostname: "myhost.compute.example.co.uk”
destination.hostname: "myhost.compute.example.co.uk”
source.domain: “example.co.uk”
destination.domain: “example.co.uk”
source.subdomain: “myhost.compute”
destination.subdomain: “myhost.compute”

CASE 5
Let’s look at the case when we have a transaction (e.g., network flow) where the source is myhost.example.com talking to destination api.example.com , and we have no information about a device. Here's the ECS fields and values you'd attempt to populate.

source.hostname: "myhost.example.com”
source.domain: "example.com”
source.subdomain: "myhost”
destination.hostname: "api.example.com”
destination.domain: "example.com”
destination.subdomain: "api”

@ruflin
Copy link
Contributor

ruflin commented Oct 1, 2018

Engineer perspective question: Assuming someone has a FQND, is it possible in a fully automated way to do the split up?

@webmat
Copy link
Contributor

webmat commented Oct 1, 2018

It's possible, but will be a mess.

Consider third and fourth level domains (see the .ca mess). I assume there's some sort of database that lists out all TLDs. I wonder how it deals with things like the .ca situation. It's been impossible to register a .qc.ca since 2010, but these domains still resolve. There's also the government one -- .gc.ca -- that's actually a domain, not a TLD, but is used by all branches of govt (so behaves like a third level domain).

Now I'm curious how Packetbeats computes dns.question.etld_plus_one.

@andrewkroh
Copy link
Member Author

Thank you for clarifying the field definitions.

Implementing the logic to generate domain from hostname is non-trivial due to the mercurial list of TLDs. To do this accurately you need to continuously update your public suffix list. We need to add a Ingest Node processor for computing this value so that this logic is centralized and only needs to be maintained in one place.

The subdomain field doesn't seem to be necessary because the event already includes both the hostname and domain. Any application that needs this could trim the domain suffix from the hostname. I think subdomain should be removed the schema.

Lastly what should an implementor do when they do not have a device's FQDN? Often I see syslog messages that contain a hostname that is not fully qualified (so it doesn't meet the requirements of hostname as described here). Should we have separate fields for hostname and fqdn?

@MikePaquette
Copy link
Contributor

Thanks @andrewkroh

Regarding the last question (if no FQDN available), would it be better to create a separate field, or to just populate the *.hostname field with the best info available?

@webmat
Copy link
Contributor

webmat commented Oct 2, 2018

@andrewkroh Why do you say the subdomain is no longer necessary? I agree that once we've computed the domain and we have the full hostname, we can derive the subdomain. But I think there's value to having the subdomain indexed separately and queryable.

The hostname vs fqdn is a thing that bugs me. I've often run infrastructures where the app servers didn't actually have an FQDN on purpose. The instances didn't have a public IP and they didn't have a DNS entry. There was just a variable amount of app servers in an autoscaling group, only available via the load balancer (or ssh via a VPN hop).

So expecting hostname to be an FQDN doesn't quite fit reality, IMO. Perhaps we need one more fields in this bunch?

  • hostname: may be a simple machine name or an FQDN
  • fqdn: the fully qualified domain name, if available
  • domain: the "registerable" domain, derived from fqdn
  • subdomain: the subdomain, derived from fqdn

So the "central" field would become the fqdn. For a host you're monitoring, you'd fill both hostname and fqdn. For a remote endpoint, you'd only populate fqdn. Then in both cases, from the fqdn, one could derive the domain and subdomain.

@webmat
Copy link
Contributor

webmat commented Oct 4, 2018

Specifically in the case of network monitoring, having the full domain of a remote resource in hostname will be very confusing for people, I think.

  • I'd feel better if the full domain ended up in fqdn and the registerable domain in domain.
  • Or even more intuitive to me, the full domain in domain and the registerable domain in another field for which I don't have a good name. Perhaps registerable_domain (it's an expression I saw elsewhere only recently and I got what meant when I read it), I'm not a huge fan of Packetbeat's etld_plus_one, although it's also descriptive, once you pause and think about it :-P

Do people (esp. my ECS colleagues, @MikePaquette & @ruflin) feel strongly that hostname should be used to describe a remote resource's full domain name? Or was this just an accident of starting from the POV of talking about a local host under management?

@ruflin
Copy link
Contributor

ruflin commented Oct 5, 2018

hostname from my POV should contain just whatever the users gets. There are lots of use cases where fqdn, domain or subdomain is not needed. The user needs to place to just drop the info he has without understanding all of the above discussion and that is hostname.

@andrewkroh
Copy link
Member Author

Why do you say the subdomain is no longer necessary?

Except for the specialize use by the ML job to compute a score exclusively on the subdomain value, I cannot think of any uses for it that are not covered by using a combination of FQDN and domain. Like if you wanted to compute the number unique subdomains associated with each domain you would bucket on domain and then count unique FQDNs. Are there use cases that aren't covered by the fqdn and domain?

@webmat
Copy link
Contributor

webmat commented Oct 5, 2018

@ruflin @MikePaquette :

Yes, my point is that hostname should be what the hostname command returns. In other words it has nothing to do with a remote resource's full domain. In other words if Packetbeat detects an outgoing call from webscale42 to api.example.com, the fields should be populated like this:

source.hostname: webscale42
destination.fqdn: api.example.com
destination.domain: example.com
destination.subdomain: api

If instead my webserver's hostname is webscale42.scalableexample.com, the fields would be populated like this:

source.hostname: webscale42
source.fqdn: webscale42.scalableexample.com
source.domain: scalableexample.com
source.subdomain: webscale42
destination.fqdn: api.example.com
destination.domain: example.com
destination.subdomain: api

This is what I understand from the initial discussion, at least.

Note that I don't actually see the usefulness of mixing hostname with the breakdown of the domain... Even if a host can be given an FQDN, I think breaking it down as if it was a remote domain we need to inspect is not particularly helpful. I think it will complicate things and will confuse people.

I'll reformulate a bit what I think would be the most straightforward way to approach this, I'd like to understand people's POV if this is missing anything. The below definitions are exactly the same for source. and destination. fields, so I won't duplicate them:

  • hostname: the name of a host under management that's taking part in this flow.
    • If both endpoints are under management, you may be able to populate hostname on both source and destination.
    • In cases where one side of the connection is a remote endpoint you don't manage, hostname is not expected to be populated for that side of the connection.
  • domain: the full domain, including subdomain.
    • Note that I'm going with the most straightforward name of domain for the most widely used value, before breaking it down. I think this will meet people's expectation best.
  • registerable_domain: the part of the domain without any subdomain. The TLD + the root of this domain.
  • subdomain: the full domain minus the registerable_domain and without a trailing period.

@andrewkroh Ok perhaps we can take it out. It's true it can be computed any time we need it. One of the security use cases is to look at the length of a subdomain, but perhaps there's no need to have it saved on every single event. I guess the question is: are there times where we need to aggregate specifically on subdomain (regardless of the actual domain, so all www for all domains)? I can't think of one.

@ruflin
Copy link
Contributor

ruflin commented Oct 15, 2018

I suggest we remove subdomain for now and add it later when needed. The discussion in this thread shows the complexity around subdomain on what it could be. Removing it does not mean tools can't use it.

@webmat
Copy link
Contributor

webmat commented Oct 16, 2018

Yeah ok, I don't mind removing subdomain for now. I agree this is getting needlessly complex.

This leaves us to determining the name for the fields we actually use. Here's how I would do it:

  • domain: the full domain, including subdomain
  • registered_domain: everything except the subdomain
  • hostname: 100% independent of domains. It's the name of the host from it's own point of view (e.g. running hostname on POSIX systems). Whether or not it's an FQDN is irrelevant, we put the value as is in hostname and we don't break it down. FQDNs are less and less relevant in a world of disposable infrastructure anyway :-)

Please let me know what you think @MikePaquette and @ruflin so we can close the loop on this :-)

@ruflin
Copy link
Contributor

ruflin commented Oct 17, 2018

I'm good with hostname, domain I leave to @andrewkroh to comment. As registered_domain is not in ECS yet, lets not go there.

@andrewkroh
Copy link
Member Author

domain was originally proposed as being the "registered domain" and I think that is the concept that is most important to keep. So regardless of field naming we need a place to record the hostname and the "registered domain".

@ruflin
Copy link
Contributor

ruflin commented Oct 18, 2018

To keep this moving:

  • domain = registered domain
  • hostname = anything

?

@webmat
Copy link
Contributor

webmat commented Oct 22, 2018

To recap what was discussed elsewhere, I'll insist again on a distinction I would like us to make.

The link between a hostname and an FQDN is more and more becoming obsolete, in my opinion. Given the the following:

  1. It's now extremely common to do load balancing between multiple hosts, to serve a given service, hosted at a given domain. Not doing so is actually becoming the exception, at least for production systems.
  2. A security best practice is to not have these app servers addressable on the public internet, since the proper way to reach them is actually via the load balancer. Therefore their hostname is actually rarely a proper FQDN.
  3. With the rise of containers, any given host may on top of that be hosting more than one application (potentially at more than one domain).

So I think nowadays people expect hostname to simply mean the internal name given to a host, and there may or may not be any link to the domain(s) it serves traffic for. If the hostname happens to be an FQDN, I'd say there is no actual interest in breaking it down or considering this having a relation to a domain that is part of the event stream. We store it as is in hostname that's the end of it.

The use of hostname as I see it:

  • host.hostname or device.hostname
    • populated at ingestion time by the agent
    • not parsed or broken down, just stored as is
  • source.hostname and/or destination.hostname
    • If there's a desire to fill those, ideally would be done at enrichment time, since any host would at best know only its own side's hostname. A network device may not know either, even if both are under management. So having an enrichment process is the solution that would fill this most reliably.
    • It's expected and worthwhile to do this for security purposes, since it gives context on which side(s) of the communication is under management vs a public endpoint.
    • Even when under source. and destination., the hostname field takes no part in the breaking down of the domain name. It's just the internal host name given to that side of the connection.

The question around the breaking down of a full domain essentially revolves around which one of the value we consider the "default" or most interesting piece of information. That one should be named domain, and the "other" one should have the longer name. Here are the options:

Full domain first:

  • domain = full domain name, including subdomain
  • registerable_domain = registerable domain, without the subdomain

Registerable domain first:

  • domain = registerable domain, without the subdomain
  • full_domain = full domain name, including subdomain

We can decide to just not use domain. This way both sides suffer, and we never get to use a shorthand, but it's fair to both preferences :trollface: ;-)

  • full_domain = full domain name, including subdomain
  • registerable_domain = registerable domain, without the subdomain

I do think we need to define two fields for domains, not just domain. Defining only domain for registerable suggests to people they are losing the full information of the full domain. Of course they can add a second field of their liking to keep it around, but I think we should just define it now. Both the full domain and the registerable domain can be useful for aggregations. We just need to agree on which one gets to be named domain.

What I gather so far on who prefers what:

I'm unsure where @MikePaquette stands on this, as initially his comments were about subdomain, which we've taken out of the discussion as a field on it's own.

@webmat
Copy link
Contributor

webmat commented Oct 22, 2018

So given all this, if you guys feel strongly about having domain as registered domain, I think it's a bit unintuitive, but i can go with it. As soon as someone sees domain and full_domain they will get the gist of it.

@webmat
Copy link
Contributor

webmat commented Oct 22, 2018

Note that url.host.name will also have to be replaced by domain & its counterpart (whichever one we decide to go with).

@webmat
Copy link
Contributor

webmat commented Oct 22, 2018

Opened PR #141 to close the loop on this. I went your preference of domain + full_domain.

I've actually come around to liking how much more compact full_domain is vs registerable_domain (even though it feels a bit awkward under url. 😄).

@MikePaquette
Copy link
Contributor

@webmat after this discussion, I'll change my original proposal and vote for
Full domain first:

  • domain = full domain name, including subdomain
  • "something else" = highest registered domain, without the subdomain

here would be some anticipated common mappings:

  • bro dns.log query -> ecs destination.domain
  • bro http.log host-> ecs destination.domain
  • cef dhost -> ecs destination.hostname
  • cef shost -> ecs source.hostname

I'm not sure why we need the additional hostname field for URL. Is there a good example where this would not map to ecs destination.domain or destination.hostname ?

@ruflin
Copy link
Contributor

ruflin commented Oct 23, 2018

If someone has domain and full_domain, both could be just put into domain as it's an array. Then querying / aggregation on it would work for both. Again for me the above discussion shows it's not that easy. So if someone has a domain if it's the full domain or not, there should be just a simple place to put it.

For the url: let's not mix this in. The url is split up based on common patterns from different programming language. If we need at one stage also domain, we can add it but not now.

@webmat
Copy link
Contributor

webmat commented Oct 23, 2018

I will break #141 down into smaller PRs, as we just discussed. The host.name => host.hostname change is straightforward and shouldn't be held up by this discussion.

We've also discussed that saving both values for the domain (the FQDN and the registered domain) as an array in one field is not the way to go, because all domains with a subdomain would then be counted twice in aggregations. Once as www.example.com and once as example.com. So we do need one field for each (the registered domain field may be omitted by integrations that don't see the value).

In working on PR #141, I realized that url.hostname is currently documented as the place to put the domain name (see in master) when breaking down a URL into its components. Given my comment from yesterday here I replaced it with the domain pair. I was not introducing anything new, I was just fixing the naming :-)

So with all of this said, it looks like we didn't actually have agreement on how to name the domain breakdown fields. Here's the options once more:

Full domain name including subdomain:

Highest Registerable Domain:

  • domain
  • registered_domain
  • hrd
  • etld_plus_one (like Packetbeat)

So we need to find a suitable pair of field names that makes sense to host the full domain name and an optional field, without a subdomain. It would be helpful if we could have a simple vote in the comments here on everyone's favourite pair of field names.

Note about fqdn: I was ok with the suggestion at first glance. But I'm not anymore, here's why: this implies that there actually is a fully qualified domain name in that field. In a DNS query, we don't actually know yet if there's anything at this domain name, we don't have the answer yet. So we can't put that in a field called fqdn. Therefore I think we need a field name that's more general than this.

@webmat
Copy link
Contributor

webmat commented Oct 23, 2018

Here's my vote (I keep going back & forth):

  • domain to host the full domain name. E.g. www.example.com
    • This is straightforward in cases where you don't need to break it down (e.g. your own web logs)
  • registered_domain to host the part without the subdomain. E.g. example.com
    • I foresee this to be done only when analyzing traffic to the outside, to create bigger destination buckets (e.g. everything going to example.com regardless of subdomain), and may not always be populated in simple cases (parsing one's own web logs). So I'm ok with this being the longer name.

@webmat
Copy link
Contributor

webmat commented Oct 26, 2018

@ruflin @MikePaquette @andrewkroh @robgil I'd like another round of opinions on the two field names around domain, when you have a moment. See my comments from a few days ago on why I no longer think fqdn: #84 (comment) (note at the very end of that comment)

@ruflin
Copy link
Contributor

ruflin commented Oct 30, 2018

I think in the end we should use what is most intuitive for the users. Very few people will read this thread to figure out what to put into this fields. My current take:

  • domain: elastic.co
  • fqdn: www.elastic.co
  • hostname: what.ever.people.felt.like.com

For subdomain I don't think it should be part of ECS for now.

@webmat
Copy link
Contributor

webmat commented Oct 30, 2018

So for a DNS request to get the IP of a domain, you'd put this in fqdn? :-)

@ruflin
Copy link
Contributor

ruflin commented Oct 31, 2018

I'm more thinking DNS related stuff should go into it's own prefix: #10

@MikePaquette
Copy link
Contributor

@webmat With ECS 1.1, we added the dns.* field set, which contains a related field dns.question.registered_domain For consistency, I would vote for using registered_domain here per your comment #84 (comment)

@mbudge
Copy link
Contributor

mbudge commented Aug 26, 2019

I propose having a parent_domain field to store the parent domain.

foo.example.com is a sub/child-domain and it's parent domain is example.com

"The original “base” zone is referred to as the parent zone, e.g. domain.com; the separated subdomain is referred to as child zone or cut node, e.g. sub.domain.com. For more technical detail, please see RFC 1035."

https://help.dyn.com/child-and-parent-zones-in-dynect/

"In general, subdomains are domains subordinate to their parent domain"
https://en.wikipedia.org/wiki/Domain_name

"Child: "The entity on record that has the delegation of the domain
from the Parent." (Quoted from [RFC7344], Section 1.1)

Parent: "The domain in which the Child is registered." (Quoted from
[RFC7344], Section 1.1) Earlier, "parent name server" was defined
in [RFC0882] as "the name server that has authority over the place
in the domain name space that will hold the new domain". (Note
that [RFC0882] was obsoleted by [RFC1034] and [RFC1035].)
[RFC819] also has some description of the relationship between
parents and children"

https://tools.ietf.org/html/rfc8499

Having a domain and hostname field could get confusing for users who don't know dns.

Having a sub-domain field seems overkill as it's not really useful. Users can search for all logs to a parent domain where the domain and parent domain don't match to get the sub-domains.

@webmat
Copy link
Contributor

webmat commented Aug 26, 2019

@mbudge Thanks for the input! By the way this is a very old issue. We didn't end up going with hostname to describe domains in ECS.

parent_domain sounds like a good option. I think there can be multiple levels of parent domains, however, correct? For example the parent domain of sub2.sub1.example.com would be sub1.example.com, correct?

With registered_domain, we're going for the very top parent, before getting to the TLD. It's not a perfect name, but it's the one that fit the best we found. It's already in the DNS field set, and will be added soon to other places that have domains in ECS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants