์ด ๊ธ€์€ GitOps ๊ธฐ๋ฐ˜ E2E ML Platform์ด

๊ตฌ์กฐ์ ์œผ๋กœ ๋ถ„๋ฆฌ๋˜์–ด ์žˆ๊ณ , ์‹ค์ œ๋กœ ๋™์ž‘ํ•˜๋ฉฐ, ์šด์˜ ๊ด€์ธก๊ณผ ๋ณต๊ตฌ๊นŒ์ง€ ๊ฐ€๋Šฅํ•œ ์ƒํƒœ์ž„์„

์บก์ฒ˜์™€ ์‹คํ–‰ ๊ฒฐ๊ณผ๋กœ ๊ฒ€์ฆํ•˜๋Š” ๊ธฐ๋ก์ž…๋‹ˆ๋‹ค.


๐Ÿงญ ๋ชฉ์ฐจ

#์„น์…˜
0์ค€๋น„
1GitOps ๊ฒฝ๊ณ„ ๊ฒ€์ฆ
2Core-only ๊ฒ€์ฆ
3Optional ON ๊ฒ€์ฆ
4Airflow E2E DAG ์‹คํ–‰ ๊ฒ€์ฆ
5Serving Runtime ๊ฒ€์ฆ
6Observability ๊ฒ€์ฆ
7์šด์˜ ๋ฌธ์„œ / Proof ๊ตฌ์กฐ ๊ฒ€์ฆ
8์ตœ์ข… ๊ฒฐ๋ก 

0) ๐Ÿงฐ ์ค€๋น„

์ด๋ฒˆ ๊ฒ€์ฆ์€ ๋‹ค์Œ ๊ธฐ์ค€์œผ๋กœ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

  • GitOps ๊ตฌ์กฐ๋Š” ArgoCD Project / Application / Root App ๊ธฐ์ค€์œผ๋กœ ํ™•์ธ
  • Optional ๋ ˆ์ด์–ด๋Š” ON / OFF ์‹คํ–‰ ๊ฒฐ๊ณผ๋กœ ํ™•์ธ
  • Airflow ํŒŒ์ดํ”„๋ผ์ธ์€ E2E DAG orchestration ์‹คํ–‰ ๊ฒฐ๊ณผ ๊ธฐ์ค€์œผ๋กœ ํ™•์ธ
  • ์„œ๋น™ ์ƒํƒœ๋Š” Triton READY / FastAPI health / reload / metrics ๊ธฐ์ค€์œผ๋กœ ํ™•์ธ
  • ์šด์˜ ๊ด€์ธก์€ Prometheus / Alertmanager / ServiceMonitor / PrometheusRule ๊ธฐ์ค€์œผ๋กœ ํ™•์ธ
  • ๋ชจ๋“  ๊ฒฐ๊ณผ๋Š” docs/proof/latest/ ๊ธฐ์ค€ ์ตœ์‹  ์Šค๋ƒ…์ƒท์œผ๋กœ ์ •๋ฆฌ

์‚ฌ์šฉํ•œ ์ฃผ์š” ๊ฒฝ๋กœ:

docs/proof/latest/projects.txt
docs/proof/latest/apps.txt
docs/proof/latest/root-apps.txt
docs/proof/latest/core_only/*
docs/proof/latest/optional_on/*
docs/proof/latest/e2e_success/*
docs/proof/latest/observability/*

1) ๐Ÿงญ GitOps ๊ฒฝ๊ณ„ ๊ฒ€์ฆ

1-1) ArgoCD Project ๊ฒฝ๊ณ„ ๊ฒ€์ฆ

proof-01-projects.png

๐Ÿ“ธ proof-01-projects.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • AppProject๊ฐ€ baseline, bootstrap, dev, optional, prod ๋กœ ๋ถ„๋ฆฌ๋˜์–ด ์žˆ์œผ๋ฉฐ

    project(default)๋Š” ๊ธฐ๋ณธ ์ œ๊ณต ํ”„๋กœ์ ํŠธ๋กœ ์กด์žฌ

  • dev๋Š” -dev, prod๋Š” -prod namespace ๋ฒ”์œ„๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํ•จ

  • optional์€ feature-store / feast ์ „์šฉ ๋ ˆ์ด์–ด๋กœ ๋ถ„๋ฆฌ๋จ

  • ๊ฐ AppProject๋Š” GitOps repository

    (mlops-infra-gitops)๋ฅผ ์ฃผ์š” source๋กœ ์‚ฌ์šฉํ•˜๋ฉฐ,

    ์ผ๋ถ€ baseline ์ปดํฌ๋„ŒํŠธ๋Š” Helm chart repository๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉ

๐Ÿงฉ ์˜๋ฏธ:

๋‹จ์ˆœ namespace ์ด๋ฆ„ ๊ตฌ๋ถ„์ด ์•„๋‹ˆ๋ผ,

ArgoCD AppProject ์ˆ˜์ค€์—์„œ ๋ฐฐํฌ ๊ฒฝ๊ณ„๋ฅผ ๊ฐ•์ œํ•˜๊ณ  ์žˆ์Œ์„ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค.

์ฆ‰ dev / prod / optional / baseline์ด ๊ตฌ์กฐ์ ์œผ๋กœ ๋ถ„๋ฆฌ๋œ GitOps ํ™˜๊ฒฝ์ž…๋‹ˆ๋‹ค.


1-2) ArgoCD Application ์ „์ฒด ๊ตฌ์กฐ ๊ฒ€์ฆ

proof-02-01-apps-list.png

๐Ÿ“ธ proof-02-01-apps-list.png

proof-02-02-apps-list.png

๐Ÿ“ธ proof-02-02-apps-list.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • Core: airflow-dev/prod, mlflow-dev/prod, fastapi-dev/prod, triton-dev/prod
  • Baseline: alloy-*, loki-*, minio-*, monitoring-*
  • Optional: feast-dev/prod, optional-envs-dev/prod
  • ๋Œ€๋ถ€๋ถ„ ์•ฑ์ด Synced / Healthy ์ƒํƒœ
  • Root Application (root-apps / root-baseline / root-optional)์„ ํ†ตํ•ด GitOps ๋ ˆ์ด์–ด๋ณ„ ApplicationSet์ด ๊ด€๋ฆฌ

๐Ÿงฉ ์˜๋ฏธ:

๋ฌธ์„œ์ƒ Core / Baseline / Optional ๊ตฌ๋ถ„์ด ์•„๋‹ˆ๋ผ,

์‹ค์ œ ArgoCD Application ๋ชฉ๋ก ์ž์ฒด๊ฐ€ ๋ ˆ์ด์–ด ๊ตฌ์กฐ๋ฅผ ๋ฐ˜์˜ํ•˜๊ณ  ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.


1-3) Root Application ๊ตฌ์กฐ ๊ฒ€์ฆ

proof-03-root-apps.png

๐Ÿ“ธ proof-03-root-apps.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • root-apps๊ฐ€ bootstrap ํ”„๋กœ์ ํŠธ ์•„๋ž˜์—์„œ ๋™์ž‘
  • AppProject, Application, ApplicationSet ๋ฆฌ์†Œ์Šค๋ฅผ ํ•จ๊ป˜ ๊ด€๋ฆฌ
  • mlops-core ApplicationSet ์กด์žฌ ํ™•์ธ

๐Ÿงฉ ์˜๋ฏธ:

์ด ํ”Œ๋žซํผ์€ ๋‹จ์ˆœํžˆ ์•ฑ ๋ช‡ ๊ฐœ๋ฅผ ์ˆ˜๋™ ๋“ฑ๋กํ•œ ๊ตฌ์กฐ๊ฐ€ ์•„๋‹ˆ๋ผ,

Root App โ†’ Project โ†’ AppSet โ†’ App ์œผ๋กœ ์ด์–ด์ง€๋Š” GitOps ์šด์˜ ๊ตฌ์กฐ๋ฅผ ๊ฐ–๊ณ  ์žˆ์Œ์„ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค.


2) ๐Ÿงฑ Core-only ๊ฒ€์ฆ

2-1) Optional OFF ์‹คํ–‰ ๊ฒ€์ฆ

proof-04-01-optional-off-run.png

๐Ÿ“ธ proof-04-01-optional-off-run.png

proof-04-02-optional-off-run.png

๐Ÿ“ธ proof-04-02-optional-off-run.png

proof-04-03-optional-off-run.png

๐Ÿ“ธ proof-04-03-optional-off-run.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • feast-dev, feast-prod, optional-envs-dev, optional-envs-prod, root-optional ์‚ญ์ œ ์ˆ˜ํ–‰

  • ๋Œ€๊ธฐ ํ›„ optional scope app ์ œ๊ฑฐ ์™„๋ฃŒ ๋ฉ”์‹œ์ง€ ํ™•์ธ

  • Optional ๋ ˆ์ด์–ด ON/OFF๋Š” Makefile ๊ธฐ๋ฐ˜ ์šด์˜ ์Šคํฌ๋ฆฝํŠธ๋กœ ๊ด€๋ฆฌ

  • Root Application์€ AppProject์™€ ApplicationSet์„ ํ•จ๊ป˜ ๊ด€๋ฆฌํ•˜๋ฉฐ

    Core ๋ ˆ์ด์–ด ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์€ ApplicationSet(mlops-core)์„ ํ†ตํ•ด ์ƒ์„ฑ

๐Ÿงฉ ์˜๋ฏธ:

Optional OFF๊ฐ€ ๋‹จ์ˆœ ๊ฐœ๋…์ด ์•„๋‹ˆ๋ผ,

์‹ค์ œ ArgoCD ์•ฑ ๋ ˆ๋ฒจ์—์„œ Optional ๋ ˆ์ด์–ด๋ฅผ ๋ถ„๋ฆฌ(detach)ํ•˜๋Š” ๋™์ž‘์ž„์„ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค.


2-2) Optional ์•ฑ ์ œ๊ฑฐ ์™„๋ฃŒ ๊ฒ€์ฆ

proof-05-optional-scope-zero.png

๐Ÿ“ธ proof-05-optional-scope-zero.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • argocd app list | grep optional, grep feast ๊ฒฐ๊ณผ๋กœ Optional ๊ด€๋ จ ์•ฑ์ด ๋” ์ด์ƒ ์กฐํšŒ๋˜์ง€ ์•Š์Œ
  • feast-dev, feast-prod namespace ๋‚ด๋ถ€ ๋ฆฌ์†Œ์Šค๋„ ์ œ๊ฑฐ๋จ

๐Ÿงฉ ์˜๋ฏธ:

Optional ๋ ˆ์ด์–ด ๊ด€๋ จ ์•ฑ๊ณผ ์ฃผ์š” ๋ฆฌ์†Œ์Šค๊ฐ€ ์ œ๊ฑฐ๋˜์–ด,

Core/Baseline๋งŒ ๋‚จ๋Š” Core-only ์ƒํƒœ๊ฐ€ ์‹ค์ œ๋กœ ๋งŒ๋“ค์–ด์กŒ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.


2-3) Core-only ์ƒํƒœ ํ—ฌ์Šค ๊ฒ€์ฆ

proof-06-core-health-probes-fastapi.png

๐Ÿ“ธ proof-06-core-health-probes-fastapi.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • dev ์—”๋“œํฌ์ธํŠธ(https://fastapi.local/health) โ†’ status=ok
  • prod ์—”๋“œํฌ์ธํŠธ(https://fastapi.prod/health) โ†’ status=ok

๐Ÿงฉ ์˜๋ฏธ:

Optional ๊ธฐ๋Šฅ์„ ์ œ๊ฑฐํ•œ ๋’ค์—๋„

FastAPI ๊ธฐ๋ฐ˜ core serving ์—”๋“œํฌ์ธํŠธ๊ฐ€ ์ •์ƒ ์œ ์ง€๋จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.


2-4) Feature Store namespace ๋น„ํŒŒ๊ดด ์œ ์ง€ ๊ฒ€์ฆ

proof-07-feature-store-ns-retained.png

๐Ÿ“ธ proof-07-feature-store-ns-retained.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • feature-store-dev, feature-store-prod namespace๋Š” Active ์ƒํƒœ๋กœ ์œ ์ง€๋จ
  • kubectl get all ๊ธฐ์ค€ namespace ๋‚ด๋ถ€ ์›Œํฌ๋กœ๋“œ ๋ฆฌ์†Œ์Šค๋Š” ์ œ๊ฑฐ๋œ ์ƒํƒœ์ž„

๐Ÿงฉ ์˜๋ฏธ:

Optional OFF๋Š” namespace ์‚ญ์ œ๊ฐ€ ์•„๋‹ˆ๋ผ

๋น„ํŒŒ๊ดด Detach ๋ฐฉ์‹์œผ๋กœ ์„ค๊ณ„๋˜์—ˆ์Œ์„ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค.

์ด๋กœ ์ธํ•ด ์žฌ๋ถ€์ฐฉ ์•ˆ์ •์„ฑ๊ณผ ์šด์˜ ๊ฒฝ๊ณ„ ์œ ์ง€๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.


3) ๐Ÿ”Œ Optional ON ๊ฒ€์ฆ

3-1) Optional ON ์‹คํ–‰ ๊ฒ€์ฆ

proof-08-optional-on-run.png

๐Ÿ“ธ proof-08-optional-on-run.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • optional-envs-dev, optional-envs-prod, feast-dev, feast-prod ๋™๊ธฐํ™” ์„ฑ๊ณต
  • ์ „์ฒด ON ํ”„๋กœ์„ธ์Šค๊ฐ€ ์ •์ƒ ์™„๋ฃŒ๋จ

๐Ÿงฉ ์˜๋ฏธ:

Optional ๋ ˆ์ด์–ด๋ฅผ ํ•„์š”ํ•  ๋•Œ ๋‹ค์‹œ ๋ถ™์ผ ์ˆ˜ ์žˆ๋Š”

Attach ๊ฐ€๋Šฅํ•œ ํ™•์žฅ ๊ตฌ์กฐ์ž„์„ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค.


3-2) Optional ์•ฑ ๋ณต๊ตฌ ๊ฒ€์ฆ

proof-09-optional-scope-restored.png

๐Ÿ“ธ proof-09-optional-scope-restored.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • feast-dev, feast-prod, optional-envs-dev, optional-envs-prod, root-optional ์ƒ์„ฑ
  • ์ƒํƒœ๊ฐ€ Synced / Healthy ๋กœ ๋ณต๊ตฌ๋จ

๐Ÿงฉ ์˜๋ฏธ:

Optional ๊ธฐ๋Šฅ์€ ์ฝ”์–ด์— ๋ฐ•ํ˜€ ์žˆ๋Š” ๊ตฌ์กฐ๊ฐ€ ์•„๋‹ˆ๋ผ,

ํ•„์š” ์‹œ ์•ˆ์ „ํ•˜๊ฒŒ attach / detach ๊ฐ€๋Šฅํ•œ ๋…๋ฆฝ ๋ ˆ์ด์–ด์ž„์„ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค.


4) ๐Ÿ”„ Airflow E2E DAG ์‹คํ–‰ ๊ฒ€์ฆ

4-1) E2E DAG ์ „์ฒด ์„ฑ๊ณต ํ๋ฆ„ ๊ฒ€์ฆ

proof-09-01-e2e-dag-success.png

๐Ÿ“ธ proof-09-01-e2e-dag-success.png

proof-09-02-e2e-dag-success.png

๐Ÿ“ธ proof-09-02-e2e-dag-success.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • e2e_full DAG์—์„œ dp.extract_raw_data โ†’ dp.validate_data โ†’ dp.build_features โ†’ dp.store_features โ†’ drift_gate โ†’ train_and_evaluate โ†’ check_result โ†’ register_model_task โ†’ check_model_ready โ†’ deploy โ†’ commit_current โ†’ fastapi_reload โ†’ observe_post_deploy_metrics โ†’ summarize_run ํ๋ฆ„์ด ์„ฑ๊ณต
  • shadow_start, notify_failure, rollback_minimal ํƒœ์Šคํฌ๋Š” ์ด๋ฒˆ ์‹คํ–‰์—์„œ ๋ถ„๊ธฐ ์กฐ๊ฑด์ƒ skipped ์ฒ˜๋ฆฌ๋จ
  • dp, deploy TaskGroup ํฌํ•จ ์ „์ฒด orchestration chain์ด ์ •์ƒ ์ข…๋ฃŒ๋จ

๐Ÿงฉ ์˜๋ฏธ:

์ด ํ”Œ๋žซํผ์ด ๋‹จ์ˆœํžˆ Triton/FastAPI๊ฐ€ ๋–  ์žˆ๋Š” ๊ตฌ์กฐ๊ฐ€ ์•„๋‹ˆ๋ผ,

Airflow๊ฐ€ ๋ฐ์ดํ„ฐ ์ค€๋น„๋ถ€ํ„ฐ ์šด์˜ ๋ฐ˜์˜ ํ›„ ๊ด€์ธก๊นŒ์ง€ ์ „์ฒด ํ๋ฆ„์„ ์ œ์–ดํ•˜๋Š” E2E ML Platform ์ž„์„ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค.


4-2) Promotion / Shadow ๋ถ„๊ธฐ ๊ตฌ์กฐ ๊ฒ€์ฆ

proof-09-03-branch-promotion-shadow.png

๐Ÿ“ธ proof-09-03-branch-promotion-shadow.png

proof-09-04-branch-promotion-shadow.png

๐Ÿ“ธ proof-09-04-branch-promotion-shadow.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • check_result ์ดํ›„ ์ •์ฑ… ๋ฐ ์ž„๊ณ„๊ฐ’์— ๋”ฐ๋ผ promotion_start ๋˜๋Š” shadow_start ๋กœ ๋ถ„๊ธฐ
  • ์„ฑ๊ณต ์‚ฌ๋ก€์—์„œ๋Š” register_model_task โ†’ check_model_ready โ†’ deploy ํ๋ฆ„์œผ๋กœ ์ด์–ด์ ธ MLflow register ๋ฐ FastAPI reload๊นŒ์ง€ ์ง„ํ–‰
  • ์ฐจ๋‹จ ์‚ฌ๋ก€์—์„œ๋Š” shadow_start ์™€ notify_failure ํ๋ฆ„์ด ์„ ํƒ๋˜๋ฉฐ, Slack ์•Œ๋ฆผ์— Branch: shadow, reason: below_threshold, Shadow path selected ๋ฉ”์‹œ์ง€๊ฐ€ ๋‚จ์Œ

๐Ÿงฉ ์˜๋ฏธ:

์ด ํ”Œ๋žซํผ์€ ํ•™์Šต ์™„๋ฃŒ๋ฅผ ๊ณง๋ฐ”๋กœ ์šด์˜ ๋ฐ˜์˜์œผ๋กœ ๋ณด์ง€ ์•Š๊ณ ,

์ž„๊ณ„๊ฐ’ ๋ฐ ์ •์ฑ… ๊ธฐ๋ฐ˜์œผ๋กœ promotion / shadow ๊ฒฝ๋กœ๋ฅผ ์‹ค์ œ๋กœ ๋ถ„๊ธฐํ•˜๋Š” ์šด์˜ํ˜• ML ํŒŒ์ดํ”„๋ผ์ธ์ž„์„ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค.


4-3) READY Sensor ๋ฐฐํฌ ๊ฒŒ์ดํŠธ ๊ฒ€์ฆ

proof-09-05-model-ready-sensor.png

๐Ÿ“ธ proof-09-05-model-ready-sensor.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • promotion ๊ฒฝ๋กœ ์‹คํ–‰์—์„œ๋Š” check_model_ready ํƒœ์Šคํฌ๊ฐ€ register ์ดํ›„, deploy ์ด์ „ ๋‹จ๊ณ„์—์„œ ์„ฑ๊ณต
  • shadow ๋ถ„๊ธฐ ์‹คํ–‰์—์„œ๋Š” check_model_ready ๊ฐ€ skipped ์ฒ˜๋ฆฌ๋˜์–ด, ๋ฐฐํฌ ๊ฒŒ์ดํŠธ๊ฐ€ promotion ๊ฒฝ๋กœ์—์„œ๋งŒ ์ ์šฉ๋จ
  • ๋ชจ๋ธ์ด READY ์ƒํƒœ๊ฐ€ ๋˜๊ธฐ ์ „๊นŒ์ง€ deploy ๋‹จ๊ณ„๊ฐ€ ์ง„ํ–‰๋˜์ง€ ์•Š์Œ

๐Ÿงฉ ์˜๋ฏธ:

๋“ฑ๋ก๋œ ๋ชจ๋ธ์ด ์กด์žฌํ•œ๋‹ค๊ณ  ๋ฐ”๋กœ ์šด์˜์— ๋ฐ˜์˜ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ,

READY sensor๋ฅผ ๋ช…์‹œ์  ๋ฐฐํฌ ๊ฒŒ์ดํŠธ๋กœ ์‚ฌ์šฉํ•˜๋Š” ์•ˆ์ „ํ•œ ์šด์˜ ๋ฐ˜์˜ ๊ตฌ์กฐ์ž„์„ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค.


4-4) Manual Rollback DAG ์‹คํ–‰ ๊ฒ€์ฆ

proof-09-06-manual-rollback-dag.png

๐Ÿ“ธ proof-09-06-manual-rollback-dag.png

proof-09-07-manual-rollback-dag.png

๐Ÿ“ธ proof-09-07-manual-rollback-dag.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • rollback_manual DAG์—์„œ validate_conf โ†’ load_ctx โ†’ guard_repo โ†’ write_ssot_files โ†’ rollback_triton โ†’ reload_fastapi โ†’ verify_convergence ์ „ ๋‹จ๊ณ„ ์„ฑ๊ณต (validate_conf๋Š” ์ž…๋ ฅ๊ฐ’ ๊ฒ€์ฆ ๋ฐ path traversal ๋ฐฉ์ง€ ๊ฐ€๋“œ)
  • rollback ์ดํ›„ /models ์‘๋‹ต์—์„œ SSOT ๊ธฐ์ค€ served version์ด ์ด์ „ ๋ฒ„์ „์œผ๋กœ ๋ณต์›๋˜๊ณ , Slack์—๋„ FastAPI reload ๊ฒฐ๊ณผ๊ฐ€ ๊ธฐ๋ก๋จ
  • clear pod-local cache ์ˆ˜ํ–‰ ํ›„ cache_pod_local ์ด ๋น„์›Œ์ง„ ์ƒํƒœ๋กœ ํ™•์ธ๋˜์–ด, pod-local override๊ฐ€ ์•„๋‹Œ SSOT ๊ธฐ์ค€ ์ƒํƒœ๋กœ ์ˆ˜๋ ดํ–ˆ์Œ์„ ํ™•์ธ

๐Ÿงฉ ์˜๋ฏธ:

์ž๋™ rollback๊ณผ ๋ณ„๊ฐœ๋กœ,

์šด์˜์ž๊ฐ€ ๋ช…์‹œ์ ์œผ๋กœ ๋ณต๊ตฌ๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ์ˆ˜๋™ ๋กค๋ฐฑ ๊ฐ€๋“œ๋ ˆ์ผ์ด ์‹ค์ œ DAG๋กœ ๊ตฌํ˜„๋˜์–ด ์žˆ์Œ์„ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค.


5) ๐Ÿš€ Serving Runtime ๊ฒ€์ฆ

5-1) Triton READY ์ƒํƒœ ๊ฒ€์ฆ (dev/prod)

proof-10-triton-ready.png

๐Ÿ“ธ proof-10-triton-ready.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • HTTP/1.1 200 OK
  • repository index ๊ฒฐ๊ณผ์—์„œ dev ํ™˜๊ฒฝ์€ best_model, best_model_shadow, prod ํ™˜๊ฒฝ์€ best_model์ด ๋กœ๋“œ๋œ ์ƒํƒœ๋ฅผ ํ™•์ธ

๐Ÿงฉ ์˜๋ฏธ:

dev/prod ํ™˜๊ฒฝ์—์„œ Triton์ด ๋‹จ์ˆœ ์‹คํ–‰ ์ค‘์ธ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ,

์‹ค์ œ READY ์ƒํƒœ์˜ ๋ชจ๋ธ์„ ๋กœ๋“œํ•ด ์„œ๋น™ ๊ฐ€๋Šฅํ•œ ๋Ÿฐํƒ€์ž„์ž„์„ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค.


5-2) FastAPI health / models / metrics ๊ฒ€์ฆ (dev/prod)

proof-11-01-fastapi-health.png

๐Ÿ“ธ proof-11-01-fastapi-health.png

proof-11-02-fastapi-models.png

๐Ÿ“ธ proof-11-02-fastapi-models.png

proof-11-03-fastapi-dev-metrics.png

๐Ÿ“ธ proof-11-03-fastapi-dev-metrics.png

proof-11-03-fastapi-prod-metrics.png

๐Ÿ“ธ proof-11-03-fastapi-prod-metrics.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • /health โ†’ status=ok
  • /models โ†’ alias A/B์˜ version, run_id ๋…ธ์ถœ
  • /metrics โ†’ http_requests_total, process/python ๋ฉ”ํŠธ๋ฆญ ๋…ธ์ถœ

๐Ÿงฉ ์˜๋ฏธ:

FastAPI๊ฐ€ ๋‹จ์ˆœ gateway๊ฐ€ ์•„๋‹ˆ๋ผ

์šด์˜ ์ƒํƒœ ์กฐํšŒ + serving metadata ๋…ธ์ถœ + metrics exporter ์—ญํ• ๊นŒ์ง€ ์ˆ˜ํ–‰ํ•จ์„ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค.


5-3) Reload API ์„ฑ๊ณต ๊ฒ€์ฆ

proof-12-reload-variant-a-b.png

๐Ÿ“ธ proof-12-reload-variant-a-b.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • status: success
  • variant: A, B
  • version: 79, 9
  • run_id: e57473991dc14de6a57dc7e5dfa1aa70

๐Ÿงฉ ์˜๋ฏธ:

FastAPI /variant/{alias}/reload API๊ฐ€ ์‹ค์ œ๋กœ ์„ฑ๊ณตํ–ˆ๊ณ ,

Triton ์„œ๋น™ ์ƒํƒœ์™€ ์—ฐ๊ฒฐ๋œ reload ๊ฒฝ๋กœ๊ฐ€ ์šด์˜์—์„œ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ˆ˜์ค€์œผ๋กœ ๋™์ž‘ํ–ˆ์Œ์„ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค.


6) ๐Ÿ“Š Observability ๊ฒ€์ฆ

6-1) Metrics API ํ™œ์„ฑํ™” ๊ฒ€์ฆ

proof-13-metrics-api.png

๐Ÿ“ธ proof-13-metrics-api.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • v1beta1.metrics.k8s.io APIService๊ฐ€ AVAILABLE=True

๐Ÿงฉ ์˜๋ฏธ:

Kubernetes cluster-level metrics๊ฐ€ ํ™œ์„ฑํ™”๋˜์–ด,

node / pod ๋ฆฌ์†Œ์Šค usage๋ฅผ ์กฐํšŒํ•  ์ˆ˜ ์žˆ๋Š” metrics API ๊ธฐ๋ฐ˜์ด

์‹ค์ œ๋กœ ์ค€๋น„๋˜์–ด ์žˆ์Œ์„ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค.


6-2) Prometheus / Alertmanager / ServiceMonitor / Rule ๊ฒ€์ฆ (dev/prod)

proof-14-monitoring-objects.png

๐Ÿ“ธ proof-14-monitoring-objects.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • Prometheus / Alertmanager ๋ฆฌ์†Œ์Šค ์กด์žฌ
  • fastapi-(dev|prod)-sm ํฌํ•จ ServiceMonitor ๋‹ค์ˆ˜ ์กด์žฌ
  • fastapi-alerts, triton-alerts ํฌํ•จ PrometheusRule ์กด์žฌ

๐Ÿงฉ ์˜๋ฏธ:

dev/prod ํ™˜๊ฒฝ์—์„œ ๋‹จ์ˆœ ๋ชจ๋‹ˆํ„ฐ๋ง ์„ค์น˜๊ฐ€ ์•„๋‹ˆ๋ผ,

FastAPI / Triton ์„œ๋น™ ๊ณ„์ธต์„ ๋Œ€์ƒ์œผ๋กœ ํ•œ ServiceMonitor ๊ธฐ๋ฐ˜ metrics ์ˆ˜์ง‘๊ณผ PrometheusRule ๊ธฐ๋ฐ˜ alert ์ •์ฑ…์ด ํ•จ๊ป˜ ๋ฐฐํฌ๋˜์–ด ์žˆ์Œ์„ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค.


6-3) Pod ๋ฆฌ์†Œ์Šค ๊ด€์ธก ๊ฒ€์ฆ

proof-15-top-pods-head.png

๐Ÿ“ธ proof-15-top-pods-head.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • airflow-*, fastapi-*, loki-*, minio-*, feast-* ๋“ฑ์˜ CPU / Memory ์‚ฌ์šฉ๋Ÿ‰ ํ™•์ธ ๊ฐ€๋Šฅ

๐Ÿงฉ ์˜๋ฏธ:

๊ด€์ธก ์‹œ์Šคํ…œ์ด ๋‹จ์ˆœ ์„ค์ •์œผ๋กœ ๋๋‚œ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ,

์‹ค์ œ ์›Œํฌ๋กœ๋“œ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ๋Ÿ‰๊นŒ์ง€ ํ™•์ธ ๊ฐ€๋Šฅํ•œ ์ƒํƒœ์ž„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.


6-4) Grafana ๋Œ€์‹œ๋ณด๋“œ ๊ฒ€์ฆ

Dev ํ™˜๊ฒฝ

proof-15-01-grafana-dashboard-dev.png

๐Ÿ“ธ proof-15-01-grafana-dashboard-dev.png

Prod ํ™˜๊ฒฝ

proof-15-02-grafana-dashboard-prod.png

๐Ÿ“ธ proof-15-02-grafana-dashboard-prod.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • Grafana์—์„œ FastAPI ์š”์ฒญ๋Ÿ‰, Pod CPU/Memory, Triton inference ๋ฐ ์„œ๋น™ ๋ฆฌ์†Œ์Šค๊ฐ€ ์‹œ๊ฐํ™”๋จ
  • Prometheus๊ฐ€ ์ˆ˜์ง‘ํ•œ ๋ฉ”ํŠธ๋ฆญ์ด Grafana ๋Œ€์‹œ๋ณด๋“œ์— ์—ฐ๊ฒฐ๋˜์–ด ์žˆ์Œ
  • ์„œ๋น„์Šค ์ƒํƒœ, API ํŠธ๋ž˜ํ”ฝ, ์„œ๋น™ ๊ณ„์ธต, ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ๋Ÿ‰์„ ํ•œ ํ™”๋ฉด์—์„œ ํ™•์ธ ๊ฐ€๋Šฅ

๐Ÿงฉ ์˜๋ฏธ:

์ด ํ”Œ๋žซํผ์€ ๋ฉ”ํŠธ๋ฆญ์„ ๋‹จ์ˆœํžˆ ์ˆ˜์ง‘ํ•˜๋Š” ๋ฐ์„œ ๋๋‚˜์ง€ ์•Š๊ณ ,

Prometheus ๊ธฐ๋ฐ˜ ์ˆ˜์ง‘ ๊ฒฐ๊ณผ๋ฅผ Grafana์—์„œ ์‹œ๊ฐ์ ์œผ๋กœ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋Š” observability ํ๋ฆ„๊นŒ์ง€ ๊ฐ–์ถ˜ ์ƒํƒœ์ž„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.


7) ๐Ÿ“ ์šด์˜ ๋ฌธ์„œ / Proof ๊ตฌ์กฐ ๊ฒ€์ฆ

7-1) Proof Index ๊ฒ€์ฆ

proof-16-proof-index.png

๐Ÿ“ธ proof-16-proof-index.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • core_only/, optional_on/, e2e_success/, observability/ ๋””๋ ‰ํ† ๋ฆฌ๊ฐ€ ๋ถ„๋ฆฌ๋˜์–ด ์กด์žฌ
  • projects.txt, root-apps.txt, root-baseline.txt, root-optional.txt ๋“ฑ GitOps ๊ฒฝ๊ณ„ ๊ด€๋ จ proof ํŒŒ์ผ ์กด์žฌ
  • pv.txt, pvc_all.txt ๋“ฑ ์Šคํ† ๋ฆฌ์ง€ ๊ด€๋ จ proof ํŒŒ์ผ๋„ ํ•จ๊ป˜ ์ •๋ฆฌ๋จ

๐Ÿงฉ ์˜๋ฏธ:

์ด ํ”„๋กœ์ ํŠธ๋Š” ์ฝ”๋“œ๋งŒ ์žˆ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ,

ํ”Œ๋žซํผ ์ƒํƒœ๋ฅผ ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„๋กœ ์ฆ๋ช…ํ•˜๋Š” proof ์ฒด๊ณ„๋ฅผ ํ•จ๊ป˜ ๊ฐ–๊ณ  ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.


7-2) ์šด์˜ ๋ฌธ์„œ ๋””๋ ‰ํ† ๋ฆฌ ๊ตฌ์กฐ ๊ฒ€์ฆ

proof-17-docs-structure.png

๐Ÿ“ธ proof-17-docs-structure.png

โœ… ํ™•์ธ ํฌ์ธํŠธ:

  • docs/overview
  • docs/runbook
  • docs/security
  • docs/proof
  • docs/feature-store

๐Ÿงฉ ์˜๋ฏธ:

์ด ํ”„๋กœ์ ํŠธ๋Š” ๋‹จ์ˆœ ๊ตฌํ˜„๋ฌผ์ด ์•„๋‹ˆ๋ผ

๊ตฌ์กฐ ์„ค๋ช…(overview), ์šด์˜ ์ ˆ์ฐจ(runbook), ๋ณด์•ˆ ์ •์ฑ…(security), ์‹คํ–‰ ์ฆ๊ฑฐ(proof) ๋ฅผ ๋ถ„๋ฆฌํ•ด ๊ด€๋ฆฌํ•˜๋Š” ๋ฌธ์„œํ˜• ํ”Œ๋žซํผ์ž„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.


8) ๐Ÿ ์ตœ์ข… ๊ฒฐ๋ก 

8-1) ์ „์ฒด ๊ฒ€์ฆ ์š”์•ฝ

์˜์—ญ๊ฒ€์ฆ ๊ฒฐ๊ณผ
GitOps ๊ฒฝ๊ณ„โœ… AppProject / Root App / AppSet ๊ตฌ์กฐ ํ™•์ธ
Core-onlyโœ… Optional ์ œ๊ฑฐ ํ›„ FastAPI / Triton ๊ฒ€์ฆ
Optional ONโœ… Feast / optional-envs ์žฌ๋ถ€์ฐฉ ์„ฑ๊ณต
Airflow DAGโœ… E2E orchestration / branch / sensor / manual rollback ๊ฒ€์ฆ
Serving Runtimeโœ… Triton READY / FastAPI metadataยทreloadยทmetrics ๊ฒ€์ฆ
Observabilityโœ… Prometheus / Alertmanager / ServiceMonitor / Rule ๋ฐฐํฌ ํ™•์ธ
๋ฌธ์„œ / ์ฆ๊ฑฐโœ… overview / runbook / security / proof ๊ตฌ์กฐ ์กด์žฌ

8-2) ๊ฒฐ๋ก 

์ด ํ”Œ๋žซํผ์€ GitOps ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฐฐํฌ ๊ฒฝ๊ณ„๋ฅผ ๋‚˜๋ˆ„๊ณ ,

Optional ๊ธฐ๋Šฅ์„ ๋…๋ฆฝ์ ์œผ๋กœ ๋‹ค๋ฃจ๋ฉฐ,

Airflow๊ฐ€ E2E ์šด์˜ ๋ฐ˜์˜ ํ๋ฆ„์„ ์ œ์–ดํ•˜๊ณ ,

Triton / FastAPI ์„œ๋น™ ๊ฒฝ๋กœ์™€ Observability๊นŒ์ง€ ํฌํ•จํ•ด

์‹ค์ œ๋กœ ๊ฒ€์ฆ ๊ฐ€๋Šฅํ•œ ์šด์˜ํ˜• E2E ML Platform์ž„์„ Proof๋กœ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.