-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework Airflow page to include guidance on Kubernetes without kedro-airflow-k8s
#4499
Comments
I think it makes sense to guide our users on how to run Kedro projects on Airflow with execution in separate containers on Kubernetes. Currently, the only guide available was the now-deprecated It seems worthwhile to do a small spike to explore the best approach. We might need to modify the This will become even more important when we implement node grouping in Airflow (issue #962), as it will make running these groups in separate environments more convenient. |
It's a good point. I'll reframe the scope of this issue. |
kedro-airflow-k8s
or mark it as deprecatedkedro-airflow-k8s
I successfully executed the Original Kedro-Airflow DAGThe original Airflow DAG generated using from __future__ import annotations
from datetime import datetime, timedelta
from pathlib import Path
from airflow import DAG
from airflow.models import BaseOperator
from airflow.utils.decorators import apply_defaults
from kedro.framework.session import KedroSession
from kedro.framework.project import configure_project
class KedroOperator(BaseOperator):
@apply_defaults
def __init__(
self,
package_name: str,
pipeline_name: str,
node_name: str | list[str],
project_path: str | Path,
env: str,
conf_source: str,
*args, **kwargs
) -> None:
super().__init__(*args, **kwargs)
self.package_name = package_name
self.pipeline_name = pipeline_name
self.node_name = node_name
self.project_path = project_path
self.env = env
self.conf_source = conf_source
def execute(self, context):
configure_project(self.package_name)
with KedroSession.create(self.project_path, env=self.env, conf_source=self.conf_source) as session:
if isinstance(self.node_name, str):
self.node_name = [self.node_name]
session.run(self.pipeline_name, node_names=self.node_name)
# Kedro settings required to run your pipeline
env = "local"
pipeline_name = "__default__"
project_path = Path.cwd()
package_name = "sf_pandas"
conf_source = "" or Path.cwd() / "conf"
# Using a DAG context manager, you don't have to specify the dag property of each task
with DAG(
dag_id="sf-pandas",
start_date=datetime(2023,1,1),
max_active_runs=3,
# https://airflow.apache.org/docs/stable/scheduler.html#dag-runs
schedule_interval="@once",
catchup=False,
# Default settings applied to all tasks
default_args=dict(
owner="airflow",
depends_on_past=False,
email_on_failure=False,
email_on_retry=False,
retries=1,
retry_delay=timedelta(minutes=5)
)
) as dag:
tasks = {
"preprocess-companies-node": KedroOperator(
task_id="preprocess-companies-node",
package_name=package_name,
pipeline_name=pipeline_name,
node_name="preprocess_companies_node",
project_path=project_path,
env=env,
conf_source=conf_source,
),
"preprocess-shuttles-node": KedroOperator(
task_id="preprocess-shuttles-node",
package_name=package_name,
pipeline_name=pipeline_name,
node_name="preprocess_shuttles_node",
project_path=project_path,
env=env,
conf_source=conf_source,
),
"create-model-input-table-node": KedroOperator(
task_id="create-model-input-table-node",
package_name=package_name,
pipeline_name=pipeline_name,
node_name="create_model_input_table_node",
project_path=project_path,
env=env,
conf_source=conf_source,
),
"split-data-node": KedroOperator(
task_id="split-data-node",
package_name=package_name,
pipeline_name=pipeline_name,
node_name="split_data_node",
project_path=project_path,
env=env,
conf_source=conf_source,
),
"train-model-node": KedroOperator(
task_id="train-model-node",
package_name=package_name,
pipeline_name=pipeline_name,
node_name="train_model_node",
project_path=project_path,
env=env,
conf_source=conf_source,
),
"evaluate-model-node": KedroOperator(
task_id="evaluate-model-node",
package_name=package_name,
pipeline_name=pipeline_name,
node_name="evaluate_model_node",
project_path=project_path,
env=env,
conf_source=conf_source,
),
}
tasks["preprocess-companies-node"] >> tasks["create-model-input-table-node"]
tasks["preprocess-shuttles-node"] >> tasks["create-model-input-table-node"]
tasks["create-model-input-table-node"] >> tasks["split-data-node"]
tasks["split-data-node"] >> tasks["evaluate-model-node"]
tasks["split-data-node"] >> tasks["train-model-node"]
tasks["train-model-node"] >> tasks["evaluate-model-node"] Improvements & Proposed Changes to
|
If I understand correctly, the implicit grouping strategy here is by pipeline, correct @DimedS ? This is consistent with what users have been telling us. For the record, I agree with everything you propose 👍🏼 |
This is cool @DimedS :D
|
thanks, @astrojuanlu ! I didn't modify the grouping strategies in this PR; that should be addressed in future PRs Thanks, @ankatiyar !
I think that sounds great! Kedro should remain the default option, as it is now.
Based on my experience with deployment plugins, it's best to keep them simple. Instead of adding complexity, I think it's better to document the process, explaining that users can easily generate a Docker image for their project using |
I created PR #4529 to close the current issue and update the documentation on manually modifying DAGs to use As a follow-up, I opened issue #1025 to enhance the If we have the resources, we can start working on this after completing the current sprint changes for the |
Description
The maintainers have mentioned that they'd rather move users towards
kedro-airflow
Documentation page (if applicable)
Context
The text was updated successfully, but these errors were encountered: