Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract: Detect location of whosonfirst data from pelias.json #235

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

orangejulius
Copy link
Member

@orangejulius orangejulius commented Feb 12, 2025

This change allows the Placeholder extract script to work in most cases without specifying the WOF_DIR environment variable.

Previously, unless you were using the particular arrangement of files and directories from pelias/docker, the default location the extract script looks for data (/data/whosonfirst-data/sqlite) was probably not correct.

I noticed this inconvenience when running Pelias locally without docker for the first time in quite a long time.

My guess/recollection is an older version of the extract script (pre-sqlite) was pure bash, and so checking pelias.json was less convenient than in the current Node.js script.

The WOF_DIR environment variable is left as an override, but my hope is with this change almost no one would have to use it.

@@ -9,7 +9,9 @@ const combinedStream = require('combined-stream');

const SQLITE_REGEX = /whosonfirst-data-[a-z0-9-]+\.db$/;

const WOF_DIR = process.env.WOF_DIR || '/data/whosonfirst-data/sqlite';
const WOF_DIR = process.env.WOF_DIR || // TODO: deprecate WOF_DIR env var after some time
config.whosonfirst.datapath || // Preferred method of finding WOF data is to autodetect from pelias.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With ES6 this can be written as config?.whosonfirst?.datapath which will avoid fatal errors if the parent path doesn't exist.

Likely not an issue in practise as we have a default value

Copy link
Member

@missinglink missinglink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, these paths are always kinda confusing, standardising them is a nice idea, the one I linked from the defaults is different again 🤷‍♂️

@orangejulius orangejulius force-pushed the detect-wof-data-location branch from dadea4f to 20bda96 Compare March 11, 2025 17:45
@orangejulius
Copy link
Member Author

I just updated this with one minor change (adjust a comment where I originally proposed we eventually remove the WOF_DIR environment variable, when instead it probably should be left around just in case).

Also, reading through some windows-related comments like this one it seems like this change might help some Windows users.

@orangejulius orangejulius force-pushed the detect-wof-data-location branch 2 times, most recently from bc72c16 to 9ad247a Compare March 11, 2025 17:57
@orangejulius
Copy link
Member Author

Okay, I've updated this yet again to fix two issues with the path detection:

  • I was using config.whosonfirst.datapath when config already pointed to the whosonfirst specific config section
  • The whosonfirst importer requires that you specify a datapath which contains a subdirectory called sqlite that contains all the .db files. On the other hand, this repo requires that you specify the full path. As a hacky fix I'm now using config.datapath + '/sqlite'. But I don't like this at all.

The sqlite dir convention comes from years ago when we were migrating from the old meta distribution files. At this point what I would love is to change the behavior of both whosonfirst and placeholder so that they find any .db files within the datapath dir, recursively. @missinglink what do you think?

@missinglink
Copy link
Member

At this point what I would love is to change the behavior of both whosonfirst and placeholder so that they find any .db files within the datapath dir, recursively.

Yeah I agree that we're very unlikely to go back to using the old geojson 'bundles'.

One thing to be careful of is that the spatial extracts (once decompressed) also share the suffix .db as they are also SQLite files.

For now I think it's probably fine to leave the 'sqlite' directory, it allows us some flexibility in the future to write other formats in the same parent directory tree (such as spatial and parquet) and means we don't have to change anything and risk breaking things.

As an aside, it's probably more robust to use path.join rather than doing string concatenation of the subdir.

@missinglink
Copy link
Member

missinglink commented Mar 12, 2025

I'm trying to understand this better...

It looks like Dockerfile sets WOF_DIR to /data/whosonfirst/sqlite, which corresponds to ${DATA_DIR}/whosonfirst/sqlite on the host, I think this is correct.

I don't see how/where /data/whosonfirst-data/sqlite could be correct, I think we just delete that.

So to run it locally, it will fail because WOF_DIR is not set and /data/whosonfirst-data/sqlite presumably doesn't exist.

In that case, I think your method is good, it can probably be simplified to:

const path = require('path');

// Use WOF_DIR env variable when available, otherwise use the location specified in pelias.json
const WOF_DIR = process.env.WOF_DIR || path.join(config.datapath, 'sqlite');

@missinglink
Copy link
Member

In fact the WOF_DIR variable is a misnomer, it's not pointing to the whosonfirst parent directory as you might assume, but in fact the sqlite dir, so it should really be named WOF_SQLITE_DIR.

I don't think the original intent was to make this a user-configurable ENV var, although it ended up that way. It was simply a way to tell the script to look in /data, which is our convention for all data inside a container.

I seriously doubt anyone is setting WOF_DIR since 99.9% of users are running our docker containers via pelias/docker, which doesn't use it.

Looking at the configs that imports.whosonfirst.datapath variable points to /data/whosonfirst anyway, so I seriously doubt WOF_DIR is even required and can be safely removed.

The exception being the default config where everything points to /mnt/pelias/, this is weird and we should fix that independently, thankfully all the docker 'projects' override imports.whosonfirst.datapath.

I guess that's all a round-about way of saying we can probably delete WOF_DIR completely and simply use: const WOF_DIR = path.join(config.datapath, 'sqlite')

@missinglink
Copy link
Member

Looking over the whole org for mentions of WOF_DIR I only found one issue mentioning it, and it seems the user actually misconfigured it, potentially causing the issue which they reported.

Screenshot 2025-03-12 at 13 42 02

https://github.com/search?q=org%3Apelias+WOF_DIR&type=issues

@missinglink
Copy link
Member

The same applies to PLACEHOLDER_DATA for all the same reasons, this should always point to /data/placeholder in Docker and is currently not configurable via pelias.json, the code is using relative paths throughout, which is kinda odd but seems to work.

@orangejulius
Copy link
Member Author

I don't see how/where /data/whosonfirst-data/sqlite could be correct, I think we just delete that.

I mean, all default path locations are sometimes wrong. This one is probably more often wrong though, so I agree. Also, we have a default whosonfirst datapath set in pelias.json so I don't think that third line could ever be reached anyway.

Your code snippet is the right way to go, I'll update the PR.

Some followup thoughts/questions that are more minor though:

  • I think having a configurable WOF_DIR is still useful for edge cases. For example, people running Placeholder outside a context with the rest of Pelias, or if they have to do something funky with paths on Windows or whatever
  • Should we even set a default WOF_DIR in the Dockerfile? The Dockerfile itself dosen't set up how any data is mounted into the container that will eventually run, so it's kinda useless. In a docker-compose.yml file though, it could make sense.

This change allows the Placeholder extract script to work in most cases
_without_ specifying the `WOF_DIR` environment variable.

Previously, unless you were using the particular arrangement of files
and directories from pelias/docker, the default location the extract script
looks for data (`/data/whosonfirst-data/sqlite`) was probably not
correct.

I noticed this inconvenience when running Pelias locally _without_
docker for the first time in quite a long time.

My guess/recollection is an older version of the extract script
(pre-sqlite) was pure bash, and so checking `pelias.json` was less
convenient than in the current Node.js script.

The `WOF_DIR` environment variable is left as an override, but my hope
is with this change almost no one would have to use it.
@orangejulius orangejulius force-pushed the detect-wof-data-location branch from 9ad247a to 57cbdda Compare March 12, 2025 13:10
@missinglink
Copy link
Member

I think having a configurable WOF_DIR is still useful for edge cases

I'm fine leaving it in if you think it's useful, of course someone could always do something like this, although agreed it's not very intuative:

jq -n '.imports.whosonfirst.datapath="/foo"'
{
  "imports": {
    "whosonfirst": {
      "datapath": "/foo"
    }
  }
}

Should we even set a default WOF_DIR in the Dockerfile?

Probably not, like you mentioned, it doesn't really make sense to configure something during docker build which might end up being invalid with docker run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants