-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest/sigma): Sigma connector integration #10037
feat(ingest/sigma): Sigma connector integration #10037
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionality:
- seems like none of the workspaces have any entities nested within them - this suggests that the container paths are not set up correctly
- can we generate more details about sigma datasets? e.g. schema, properties, documentation, ownership?
- datasets are part of workspaces/folders right? they should have a container aspect
- what happens to folders within a workspace?
- can we get metadata from "badges" in sigma - those would probably be pretty valuable
- for charts/dashboards, can we generate externalUrl info so we get the "View in Sigma" button in the UI
for workspace_dict in response.json(): | ||
workspaces.append(Workspace.parse_obj(workspace_dict)) | ||
except Exception as e: | ||
self._log_http_error(message=f"Unable to fetch workspaces. Exception: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this happens, it should appear as a "failure" in the source report
) | ||
|
||
def _get_sigma_dataset_identifier(self, dataset: SigmaDataset) -> str: | ||
return f"{dataset.datasetId}".lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we making these lowercased?
|
you should emit a browsePathsV2 aspect, where the first parts are dataPlatformInstance (if set) and workspace container urns, and the remaining parts are strings with the folder names
They should become tags in datahub, prefixed with "sigma:" |
a93b39d
to
42321a3
Compare
name: str | ||
description: str | ||
createdBy: str | ||
createdAt: datetime | ||
updatedAt: datetime | ||
url: str | ||
path: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be better to keep this as a List[str]
- that way if someone has a /
in their folder name, we still handle it correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are getting path attribute details from API in string format. If we later convert it to list of string with any one folder containing /
in name, still that folder name will get split.
and "*" in self.config.chart_sources_platform_mapping | ||
): | ||
data_source_platform_details = self.config.chart_sources_platform_mapping[ | ||
"*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the *
is basically a "fallback" platform detail if nothing else matches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If user wants to provide platform details to all sources used by all charts which got ingested, *
as key can be used.
inputs[ | ||
self._gen_sigma_dataset_urn(source_id) | ||
] = in_table_urn | ||
sql_parser_in_tables.remove(in_table_urn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll probably need to come back to this to generate CLL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To generate CLL, we don't have schema metadata for datasets. We do have columns details used by elements/charts, but can't map that columns with there tables.
self.users: Dict = {} | ||
self.workspaces: Dict[str, Workspace] = {} | ||
self.users: Dict[str, str] = {} | ||
self.datasets: Dict[str, str] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we saving these? I would've expected these to be saved in sigma.py, not in sigma_api.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed self.datasets
but others can't be moved or removed. We even can't use functools.lru_cache
as it gives problem with function fetching response from GET API.
``` | ||
|
||
#### Example - For all ingested charts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow up - the order of these examples should be reversed, and they should be rephrased as below
#### Example - For all ingested charts | |
#### Example - All workbooks use the same connection |
@@ -22,7 +22,6 @@ def __init__(self, config: SigmaSourceConfig) -> None: | |||
self.config = config | |||
self.workspaces: Dict[str, Workspace] = {} | |||
self.users: Dict[str, str] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my comment was about users and workspaces too - those should probably not be getting saved here?
Co-authored-by: Harshal Sheth <[email protected]>
Checklist