Commit 06794812 067948120de164a61de09d3141d3c106299c68ff by cnb.bofCdSsphPA

Attach runnable command templates to the extraction plan

Constraint: The Phase-1 PostgreSQL plan needed to become immediately actionable without pretending the workers already exist
Rejected: Keep the plan as ordering-only metadata | It still leaves the next session to reconstruct command wiring by hand
Confidence: high
Scope-risk: narrow
Directive: Keep future worker implementations compatible with the env-var contract emitted by the planner report
Tested: /usr/local/miniconda3/bin/python scripts/plan_phase1_extraction_jobs_live.py --dsn 'postgres://d2:d2pass@127.0.0.1:5432/d2' --schema acr_test --job-status pending --output data/pgvector_eval/music20/phase1_extraction_plan_report.json; /usr/local/miniconda3/bin/python -m py_compile scripts/plan_phase1_extraction_jobs_live.py; git diff --check -- acr-engine/scripts/plan_phase1_extraction_jobs_live.py acr-engine/data/pgvector_eval/music20/phase1_extraction_plan_report.json docs/model-feature-registry-bootstrap.md docs/postgres_db_schema_samples.md docs/session-handoff.md docs/CHANGELOG.md
Not-tested: Real worker binaries at workers/run_chromaprint_job.py and workers/run_embedding_job.py do not exist yet
1 parent f13caa3e
......@@ -49,6 +49,10 @@
"run feature extraction for chromaprint v1",
"write to audio_fingerprint",
"target scope: reference_set:phase1_hot_reference_v1"
],
"command_suggestions": [
"EXTRACTION_JOB_ID=1 FEATURE_SET_ID=2 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test OUTPUT_TARGET=audio_fingerprint \\\npython workers/run_chromaprint_job.py",
"EXTRACTION_JOB_ID=1 FEATURE_SET_ID=2 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test \\\npython workers/mark_job_status.py --status running"
]
},
{
......@@ -90,6 +94,10 @@
"run feature extraction for mert v1-95m",
"write to audio_embedding + audio_embedding_vector_768",
"target scope: reference_set:phase1_hot_reference_v1"
],
"command_suggestions": [
"EXTRACTION_JOB_ID=2 FEATURE_SET_ID=3 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test MODEL_NAME=mert MODEL_VERSION=v1-95m VECTOR_TABLE=audio_embedding_vector_768 OUTPUT_TARGET=audio_embedding \\\npython workers/run_embedding_job.py",
"EXTRACTION_JOB_ID=2 FEATURE_SET_ID=3 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test \\\npython workers/mark_job_status.py --status running"
]
},
{
......@@ -131,6 +139,10 @@
"run feature extraction for mert v1-95m",
"write to audio_embedding + audio_embedding_vector_768",
"target scope: reference_set:phase1_hot_reference_v1"
],
"command_suggestions": [
"EXTRACTION_JOB_ID=3 FEATURE_SET_ID=4 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test MODEL_NAME=mert MODEL_VERSION=v1-95m VECTOR_TABLE=audio_embedding_vector_768 OUTPUT_TARGET=audio_embedding \\\npython workers/run_embedding_job.py",
"EXTRACTION_JOB_ID=3 FEATURE_SET_ID=4 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test \\\npython workers/mark_job_status.py --status running"
]
},
{
......@@ -172,6 +184,10 @@
"run feature extraction for muq large-msd-iter",
"write to audio_embedding + audio_embedding_vector_768",
"target scope: reference_set:phase1_hot_reference_v1"
],
"command_suggestions": [
"EXTRACTION_JOB_ID=4 FEATURE_SET_ID=5 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test MODEL_NAME=muq MODEL_VERSION=large-msd-iter VECTOR_TABLE=audio_embedding_vector_768 OUTPUT_TARGET=audio_embedding \\\npython workers/run_embedding_job.py",
"EXTRACTION_JOB_ID=4 FEATURE_SET_ID=5 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test \\\npython workers/mark_job_status.py --status running"
]
},
{
......@@ -213,6 +229,10 @@
"run feature extraction for ecapa acr-baseline-v1",
"write to audio_embedding + audio_embedding_vector_192",
"target scope: reference_set:phase1_hot_reference_v1"
],
"command_suggestions": [
"EXTRACTION_JOB_ID=5 FEATURE_SET_ID=6 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test MODEL_NAME=ecapa MODEL_VERSION=acr-baseline-v1 VECTOR_TABLE=audio_embedding_vector_192 OUTPUT_TARGET=audio_embedding \\\npython workers/run_embedding_job.py",
"EXTRACTION_JOB_ID=5 FEATURE_SET_ID=6 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test \\\npython workers/mark_job_status.py --status running"
]
}
],
......@@ -257,6 +277,10 @@
"run feature extraction for chromaprint v1",
"write to audio_fingerprint",
"target scope: reference_set:phase1_hot_reference_v1"
],
"command_suggestions": [
"EXTRACTION_JOB_ID=1 FEATURE_SET_ID=2 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test OUTPUT_TARGET=audio_fingerprint \\\npython workers/run_chromaprint_job.py",
"EXTRACTION_JOB_ID=1 FEATURE_SET_ID=2 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test \\\npython workers/mark_job_status.py --status running"
]
}
],
......@@ -300,6 +324,10 @@
"run feature extraction for mert v1-95m",
"write to audio_embedding + audio_embedding_vector_768",
"target scope: reference_set:phase1_hot_reference_v1"
],
"command_suggestions": [
"EXTRACTION_JOB_ID=2 FEATURE_SET_ID=3 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test MODEL_NAME=mert MODEL_VERSION=v1-95m VECTOR_TABLE=audio_embedding_vector_768 OUTPUT_TARGET=audio_embedding \\\npython workers/run_embedding_job.py",
"EXTRACTION_JOB_ID=2 FEATURE_SET_ID=3 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test \\\npython workers/mark_job_status.py --status running"
]
},
{
......@@ -341,6 +369,10 @@
"run feature extraction for mert v1-95m",
"write to audio_embedding + audio_embedding_vector_768",
"target scope: reference_set:phase1_hot_reference_v1"
],
"command_suggestions": [
"EXTRACTION_JOB_ID=3 FEATURE_SET_ID=4 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test MODEL_NAME=mert MODEL_VERSION=v1-95m VECTOR_TABLE=audio_embedding_vector_768 OUTPUT_TARGET=audio_embedding \\\npython workers/run_embedding_job.py",
"EXTRACTION_JOB_ID=3 FEATURE_SET_ID=4 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test \\\npython workers/mark_job_status.py --status running"
]
},
{
......@@ -382,6 +414,10 @@
"run feature extraction for muq large-msd-iter",
"write to audio_embedding + audio_embedding_vector_768",
"target scope: reference_set:phase1_hot_reference_v1"
],
"command_suggestions": [
"EXTRACTION_JOB_ID=4 FEATURE_SET_ID=5 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test MODEL_NAME=muq MODEL_VERSION=large-msd-iter VECTOR_TABLE=audio_embedding_vector_768 OUTPUT_TARGET=audio_embedding \\\npython workers/run_embedding_job.py",
"EXTRACTION_JOB_ID=4 FEATURE_SET_ID=5 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test \\\npython workers/mark_job_status.py --status running"
]
},
{
......@@ -423,6 +459,10 @@
"run feature extraction for ecapa acr-baseline-v1",
"write to audio_embedding + audio_embedding_vector_192",
"target scope: reference_set:phase1_hot_reference_v1"
],
"command_suggestions": [
"EXTRACTION_JOB_ID=5 FEATURE_SET_ID=6 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test MODEL_NAME=ecapa MODEL_VERSION=acr-baseline-v1 VECTOR_TABLE=audio_embedding_vector_192 OUTPUT_TARGET=audio_embedding \\\npython workers/run_embedding_job.py",
"EXTRACTION_JOB_ID=5 FEATURE_SET_ID=6 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test \\\npython workers/mark_job_status.py --status running"
]
}
]
......@@ -436,7 +476,8 @@
"feature_name": "fingerprint_asset",
"window_sec": 5.0,
"hop_sec": 2.5,
"physical_target": "audio_fingerprint"
"physical_target": "audio_fingerprint",
"primary_command": "EXTRACTION_JOB_ID=1 FEATURE_SET_ID=2 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test OUTPUT_TARGET=audio_fingerprint \\\npython workers/run_chromaprint_job.py"
},
{
"order": 2,
......@@ -446,7 +487,8 @@
"feature_name": "semantic_embedding",
"window_sec": 5.0,
"hop_sec": 2.5,
"physical_target": "audio_embedding"
"physical_target": "audio_embedding",
"primary_command": "EXTRACTION_JOB_ID=2 FEATURE_SET_ID=3 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test MODEL_NAME=mert MODEL_VERSION=v1-95m VECTOR_TABLE=audio_embedding_vector_768 OUTPUT_TARGET=audio_embedding \\\npython workers/run_embedding_job.py"
},
{
"order": 3,
......@@ -456,7 +498,8 @@
"feature_name": "semantic_embedding",
"window_sec": 10.0,
"hop_sec": 5.0,
"physical_target": "audio_embedding"
"physical_target": "audio_embedding",
"primary_command": "EXTRACTION_JOB_ID=3 FEATURE_SET_ID=4 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test MODEL_NAME=mert MODEL_VERSION=v1-95m VECTOR_TABLE=audio_embedding_vector_768 OUTPUT_TARGET=audio_embedding \\\npython workers/run_embedding_job.py"
},
{
"order": 4,
......@@ -466,7 +509,8 @@
"feature_name": "semantic_embedding",
"window_sec": 5.0,
"hop_sec": 2.5,
"physical_target": "audio_embedding"
"physical_target": "audio_embedding",
"primary_command": "EXTRACTION_JOB_ID=4 FEATURE_SET_ID=5 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test MODEL_NAME=muq MODEL_VERSION=large-msd-iter VECTOR_TABLE=audio_embedding_vector_768 OUTPUT_TARGET=audio_embedding \\\npython workers/run_embedding_job.py"
},
{
"order": 5,
......@@ -476,7 +520,8 @@
"feature_name": "semantic_embedding",
"window_sec": 5.0,
"hop_sec": 2.5,
"physical_target": "audio_embedding"
"physical_target": "audio_embedding",
"primary_command": "EXTRACTION_JOB_ID=5 FEATURE_SET_ID=6 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test MODEL_NAME=ecapa MODEL_VERSION=acr-baseline-v1 VECTOR_TABLE=audio_embedding_vector_192 OUTPUT_TARGET=audio_embedding \\\npython workers/run_embedding_job.py"
}
]
}
\ No newline at end of file
......
......@@ -25,6 +25,26 @@ def parse_target_scope(target_scope: str) -> dict[str, Any]:
return {'scope_type': 'unknown', 'scope_value': target_scope}
def build_command_suggestions(job: dict[str, Any], schema: str) -> list[str]:
base_env = f"EXTRACTION_JOB_ID={job['extraction_job_id']} FEATURE_SET_ID={job['feature_set_id']} TARGET_SCOPE='{job['target_scope']}' PG_SCHEMA={schema}"
commands = []
if job['lane'] == 'exact':
commands.append(
base_env
+ " OUTPUT_TARGET=audio_fingerprint \\\npython workers/run_chromaprint_job.py"
)
else:
commands.append(
base_env
+ f" MODEL_NAME={job['model_name']} MODEL_VERSION={job['model_version']} VECTOR_TABLE={job['vector_table']} OUTPUT_TARGET={job['physical_target']} \\\npython workers/run_embedding_job.py"
)
commands.append(
base_env
+ " \\\npython workers/mark_job_status.py --status running"
)
return commands
def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument('--dsn', required=True)
......@@ -112,6 +132,7 @@ def main() -> None:
f"target scope: {row[2]}",
],
}
item['command_suggestions'] = build_command_suggestions(item, args.schema)
jobs.append(item)
by_lane.setdefault(lane, []).append(item)
......@@ -139,6 +160,7 @@ def main() -> None:
'window_sec': job['window_sec'],
'hop_sec': job['hop_sec'],
'physical_target': job['physical_target'],
'primary_command': job['command_suggestions'][0],
}
for idx, job in enumerate(jobs)
],
......
## 2026-06-04
- 更新 `plan_phase1_extraction_jobs_live.py``phase1_extraction_plan_report.json`,把 Phase-1 execution plan 从“仅有排序计划”推进到“附带 `command_suggestions / primary_command` 的可复制执行命令模板”。
- 新增 `acr-engine/scripts/plan_phase1_extraction_jobs_live.py``acr-engine/data/pgvector_eval/music20/phase1_extraction_plan_report.json`,支持从 PostgreSQL 的 `feature_extraction_job` 真实读取 pending jobs,并联表生成按 lane / priority 排序的 Phase-1 execution plan。
- 新增 `acr-engine/scripts/bootstrap_phase1_extraction_jobs_live.py``acr-engine/data/pgvector_eval/music20/phase1_extraction_jobs_report.json`,把 Phase-1 的 `feature_extraction_job` 初始化做成可直接连 PostgreSQL 的 live 脚本,并已在 `acr_test` schema 真实创建 5 条 pending jobs。
- 补充 `phase1_registry_bootstrap_idempotency_report.json` 与文档说明,验证 `bootstrap_phase1_model_registry_live.py``acr_test` schema 上连续执行两次后表计数保持稳定,证明 Phase-1 registry bootstrap 具备可重复执行的幂等性。
......
......@@ -397,3 +397,25 @@ cd /workspace/acr-engine
结论:
> 现在 PostgreSQL 里已经不仅能描述“有哪些 job”,还可以直接生成**按执行顺序排好的抽特征计划**。
### 10.3 ready-to-run command suggestions(已补齐)
本轮又进一步把 planner 升级为:**每条 job 都生成 command suggestion**
示例:
#### exact lane
```bash
EXTRACTION_JOB_ID=1 FEATURE_SET_ID=2 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test OUTPUT_TARGET=audio_fingerprint \
python workers/run_chromaprint_job.py
```
#### semantic lane
```bash
EXTRACTION_JOB_ID=2 FEATURE_SET_ID=3 TARGET_SCOPE='reference_set:phase1_hot_reference_v1' PG_SCHEMA=acr_test MODEL_NAME=mert MODEL_VERSION=v1-95m VECTOR_TABLE=audio_embedding_vector_768 OUTPUT_TARGET=audio_embedding \
python workers/run_embedding_job.py
```
这意味着下个 session 不需要先手工拼环境变量和 job 绑定关系,而可以直接从 planner 报告里复制命令模板。
......
......@@ -430,6 +430,12 @@ flowchart LR
对应 live 报告:
- `acr-engine/data/pgvector_eval/music20/phase1_extraction_plan_report.json`
本轮补充后,plan 里还会真实给出:
- `command_suggestions`
- `primary_command`
也就是从 PostgreSQL 的 pending jobs 已经可以直接走到“可复制的执行命令模板”。
### 路线 1:继续做 PostgreSQL 工程化
1.`live_pgvector_music20_eval.py` 泛化成:
......
......@@ -185,6 +185,7 @@ sed -n '1,320p' acr-engine/sql/acr_pg_schema_v2.sql
- Phase-1 registry bootstrap 已有幂等性证据:同 schema 连续执行两次后,`model_registry=5 / feature_set_registry=6 / reference_set_registry=2` 保持不变
- PostgreSQL `acr_test` schema 上已真实创建 5 条 `feature_extraction_job`,后续 MERT / MuQ 接入可直接从 pending jobs 启动
- PostgreSQL `acr_test` schema 上已真实生成 Phase-1 extraction execution plan,当前顺序是 `chromaprint -> mert -> mert-long -> muq -> ecapa`
- extraction plan 报告里已包含 `command_suggestions / primary_command`,下次可直接从 plan 抄 worker 命令模板
### 未验证 / 仍是缺口
- **未实际跑 MERT / MuQ encoder-only 特征抽取**
......