Skip to content

Commit

Permalink
Change structure of URL
Browse files Browse the repository at this point in the history
So far the url structure was heavily inspired by whatwg/url#337. I initially only wanted to make some tweaks to it to improve querying but I realised I never fully felt comfortable with the field names used here. So I started to look at the url parser of different languages like Go, Ruby, Python and the output they provide are surprisingly similar but not consistent with whatwg. The change made here brings the field names closer to what most url parsers output.
  • Loading branch information
ruflin committed May 30, 2018
1 parent ca691df commit 853bb75
Show file tree
Hide file tree
Showing 4 changed files with 51 additions and 59 deletions.
25 changes: 11 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -342,30 +342,27 @@ Source fields describe details about the source of where the event is coming fro

## <a name="url"></a> URL fields

A complete URL, with scheme, host, and path.
A complete URL, with scheme, host and path.

The URL object can be reused in other prefixes like `host.url.*` for example. It is important that whenever URL is used that the same structure is used.

`url.href` is a [multi field](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/multi-fields.html#_multi_fields_with_multiple_analyzers) which means the data is stored as keyword `url.href` and test `url.href.analyzed`. The advantage of this is that for running a query against only a part of the url still works without having to split up the URL in all its part on ingest time.

Based on whatwg URL definition: https://github.com/whatwg/url/issues/337
`url.href` is a [multi field](https://www.elastic.co/guide/en/ elasticsearch/reference/6.2/ multi-fields.html#_multi_fields_with_multiple_analyzers) which means the data is stored as keyword `url.href` and test `url.href.analyzed`. The advantage of this is that for running a query against only a part of the url still works without having to split up the URL in all its part on ingest time.


| Field | Description | Type | Multi Field | Example |
|---|---|---|---|---|
| <a name="url.href"></a>`url.href` | href contains the full url. The field is stored as keyword.<br/>`href` is an analyzed field so the parsed information can be accessed through `href.analyzed` in queries. | keyword | | `https://elastic.co:443/search?q=elasticsearch#top` |
| <a name="url.href"></a>`url.href` | href contains the full url. The field is stored as keyword.<br/>`href` is an analyzed field so the parsed information can be accessed through `href.analyzed` in quries. | keyword | | `https://elastic.co:443/search?q=elasticsearch#top` |
| <a name="url.href.analyzed"></a>`url.href.analyzed` | | text | 1 | |
| <a name="url.protocol"></a>`url.protocol` | The protocol of the request, e.g. "https:". | keyword | | |
| <a name="url.hostname"></a>`url.hostname` | The hostname of the request, e.g. "example.com".<br/>For correlation the this field can be copied into the `host.name` field. | keyword | | |
| <a name="url.port"></a>`url.port` | The port of the request, e.g. 443. | keyword | | |
| <a name="url.pathname"></a>`url.pathname` | The path of the request, e.g. "/search". | text | | |
| <a name="url.pathname.raw"></a>`url.pathname.raw` | The url path. This is a non-analyzed field that is useful for aggregations. | keyword | 1 | |
| <a name="url.search"></a>`url.search` | The search describes the query string of the request, e.g. "q=elasticsearch". | text | | |
| <a name="url.search.raw"></a>`url.search.raw` | The url search part. This is a non-analyzed field that is useful for aggregations. | keyword | 1 | |
| <a name="url.hash"></a>`url.hash` | The hash of the request URL, e.g. "top". | keyword | | |
| <a name="url.scheme"></a>`url.scheme` | The scheme of the request, e.g. "https".<br/>Note: The `:` is not part of the scheme. | keyword | | `https` |
| <a name="url.hostname"></a>`url.hostname` | The hostname of the request, e.g. "example.com".<br/>For correlation the this field can be copied into the `host.name` field. | keyword | | `elastic.co` |
| <a name="url.port"></a>`url.port` | The port of the request, e.g. 443. | long | | `443` |
| <a name="url.path"></a>`url.path` | The path of the request, e.g. "/search". | text | | |
| <a name="url.path.raw"></a>`url.path.raw` | The url path. This is a non-analyzed field that is useful for aggregations. | keyword | 1 | |
| <a name="url.query"></a>`url.query` | The search describes the query string of the request, e.g. "q=elasticsearch".<br/>The `?` is excluded from the query string. In case an URL contains no `?` it is expected that the query field is left out. In case there is a `?` but no query, the query field is expected to exist with an empty string. Like this the `exists` query can be used to differentiate between the two cases. | text | | |
| <a name="url.query.raw"></a>`url.query.raw` | The url query part. This is a non-analyzed field that is useful for aggregations. | keyword | 1 | |
| <a name="url.fragment"></a>`url.fragment` | The part of the url after the `#`, e.g. "top".<br/>The `#` is not part of the fragment. | keyword | | |
| <a name="url.username"></a>`url.username` | The username of the request. | keyword | | |
| <a name="url.password"></a>`url.password` | The password of the request. | keyword | | |
| <a name="url.extension"></a>`url.extension` | The url extension field contains the extension of the file associated with the url.<br/>A simple example is `http://localhost/logo.png` where the extension would be `png`. There can also be more complex cases like `http://localhost/content?asset=logo.png&token=XYZ` where the extension could also be `png` but depends on the implementation.<br/>The `extension` field should be left out if the extension is not defined. | keyword | | `png` |


## <a name="user"></a> User fields
Expand Down
13 changes: 6 additions & 7 deletions schema.csv
Original file line number Diff line number Diff line change
Expand Up @@ -112,15 +112,14 @@ source.ip,ip,0,
source.mac,keyword,1,
source.port,long,1,
source.subdomain,keyword,1,
url.extension,keyword,0,png
url.hash,keyword,0,
url.hostname,keyword,0,
url.fragment,keyword,0,
url.hostname,keyword,0,elastic.co
url.href,keyword,0,https://elastic.co:443/search?q=elasticsearch#top
url.password,keyword,0,
url.pathname,text,0,
url.port,keyword,0,
url.protocol,keyword,0,
url.search,text,0,
url.path,text,0,
url.port,long,0,443
url.query,text,0,
url.scheme,keyword,0,https
url.username,keyword,0,
user.email,keyword,1,
user.hash,keyword,1,
Expand Down
51 changes: 26 additions & 25 deletions schemas/url.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,47 +2,52 @@
- name: url
title: URL
description: >
A complete URL, with scheme, host, and path.
A complete URL, with scheme, host and path.
The URL object can be reused in other prefixes like `host.url.*` for
example. It is important that whenever URL is used that the same structure
is used.
`url.href` is a [multi field](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/multi-fields.html#_multi_fields_with_multiple_analyzers)
`url.href` is a [multi field](https://www.elastic.co/guide/en/
elasticsearch/reference/6.2/
multi-fields.html#_multi_fields_with_multiple_analyzers)
which means the data is stored as keyword `url.href` and test
`url.href.analyzed`. The advantage of this is that for running a query
against only a part of the url still works without having to split up the
URL in all its part on ingest time.
Based on whatwg URL definition: https://github.com/whatwg/url/issues/337
fields:
- name: href
type: keyword
description: >
href contains the full url. The field is stored as keyword.
`href` is an analyzed field so the parsed information can be accessed
through `href.analyzed` in queries.
through `href.analyzed` in quries.
multi_fields:
- name: analyzed
type: text
example: https://elastic.co:443/search?q=elasticsearch#top
- name: protocol
- name: scheme
type: keyword
description: >
The protocol of the request, e.g. "https:".
The scheme of the request, e.g. "https".
Note: The `:` is not part of the scheme.
example: https
- name: hostname
type: keyword
description: >
The hostname of the request, e.g. "example.com".
For correlation the this field can be copied into the `host.name`
field.
example: elastic.co
- name: port
type: keyword
type: long
description: >
The port of the request, e.g. 443.
- name: pathname
example: 443
- name: path
type: text
description: >
The path of the request, e.g. "/search".
Expand All @@ -52,21 +57,29 @@
description: >
The url path. This is a non-analyzed field that is useful
for aggregations.
- name: search
- name: query
type: text
description: >
The search describes the query string of the request,
e.g. "q=elasticsearch".
The `?` is excluded from the query string. In case an URL
contains no `?` it is expected that the query field is left out.
In case there is a `?` but no query, the query field is expected
to exist with an empty string. Like this the `exists` query can be
used to differentiate between the two cases.
multi_fields:
- name: raw
type: keyword
description: >
The url search part. This is a non-analyzed field that is useful
The url query part. This is a non-analyzed field that is useful
for aggregations.
- name: hash
- name: fragment
type: keyword
description: >
The hash of the request URL, e.g. "top".
The part of the url after the `#`, e.g. "top".
The `#` is not part of the fragment.
- name: username
type: keyword
description: >
Expand All @@ -75,15 +88,3 @@
type: keyword
description: >
The password of the request.
- name: extension
type: keyword
description: >
The url extension field contains the extension of the file associated with
the url.
A simple example is `http://localhost/logo.png` where the extension would be `png`.
There can also be more complex cases like `http://localhost/content?asset=logo.png&token=XYZ`
where the extension could also be `png` but depends on the implementation.
The `extension` field should be left out if the extension is not defined.
example: png
21 changes: 8 additions & 13 deletions template.json
Original file line number Diff line number Diff line change
Expand Up @@ -572,11 +572,7 @@
},
"url": {
"properties": {
"extension": {
"ignore_above": 1024,
"type": "keyword"
},
"hash": {
"fragment": {
"ignore_above": 1024,
"type": "keyword"
},
Expand All @@ -598,7 +594,7 @@
"ignore_above": 1024,
"type": "keyword"
},
"pathname": {
"path": {
"fields": {
"raw": {
"ignore_above": 1024,
Expand All @@ -609,14 +605,9 @@
"type": "text"
},
"port": {
"ignore_above": 1024,
"type": "keyword"
},
"protocol": {
"ignore_above": 1024,
"type": "keyword"
"type": "long"
},
"search": {
"query": {
"fields": {
"raw": {
"ignore_above": 1024,
Expand All @@ -626,6 +617,10 @@
"norms": false,
"type": "text"
},
"scheme": {
"ignore_above": 1024,
"type": "keyword"
},
"username": {
"ignore_above": 1024,
"type": "keyword"
Expand Down

0 comments on commit 853bb75

Please sign in to comment.