Commit b766c74e b766c74e9ff1c9be3223d226d9ef4e0da9a7cb03 by cnb.bofCdSsphPA

Make open-dataset manifests trainable end to end

Constraint: Open dataset onboarding was incomplete until generated manifests could enter train.py without manual path fixes
Rejected: Keep manifests as ingestion-only artifacts | Fails the actual training handoff and leaves the workflow broken
Confidence: high
Scope-risk: moderate
Directive: Preserve the self-contained output layout (audio plus manifests) for all future external dataset imports
Tested: /usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma data/synthetic_v2/songs --output-root data/external_ingested/synthetic_as_open_fixed --eval-ratio 0.2 --query-duration 5.0; /usr/local/miniconda3/bin/python src/data/external_adapters.py validate-local fma data/external_ingested/synthetic_as_open_fixed/fma/manifests; /usr/local/miniconda3/bin/python train.py --data data/external_ingested/synthetic_as_open_fixed/fma/manifests --output data/models_open_smoke_fixed --device cpu --epochs 1 --batch-size 2 --dry-run; /usr/local/miniconda3/bin/python -m py_compile src/data/dataset.py train.py src/data/manifest_tools.py src/data/external_adapters.py
Not-tested: Full multi-epoch training and index/eval loop on a real downloaded FMA or MTG-Jamendo corpus
1 parent fa231444
Showing 32 changed files with 772 additions and 9 deletions
[
{
"song_id": "fma_00000",
"audio_path": "audio/fma_00000.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00001",
"audio_path": "audio/fma_00001.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00002",
"audio_path": "audio/fma_00002.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00003",
"audio_path": "audio/fma_00003.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00004",
"audio_path": "audio/fma_00004.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00005",
"audio_path": "audio/fma_00005.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00006",
"audio_path": "audio/fma_00006.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00007",
"audio_path": "audio/fma_00007.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00008",
"audio_path": "audio/fma_00008.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00009",
"audio_path": "audio/fma_00009.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00010",
"audio_path": "audio/fma_00010.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00011",
"audio_path": "audio/fma_00011.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00012",
"audio_path": "audio/fma_00012.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00013",
"audio_path": "audio/fma_00013.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00014",
"audio_path": "audio/fma_00014.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00015",
"audio_path": "audio/fma_00015.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00016",
"audio_path": "audio/fma_00016.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00017",
"audio_path": "audio/fma_00017.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00018",
"audio_path": "audio/fma_00018.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00019",
"audio_path": "audio/fma_00019.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00020",
"audio_path": "audio/fma_00020.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00021",
"audio_path": "audio/fma_00021.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00022",
"audio_path": "audio/fma_00022.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00023",
"audio_path": "audio/fma_00023.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
}
]
\ No newline at end of file
[
{
"song_id": "fma_00000",
"audio_path": "audio/fma_00000.wav",
"duration": 5.0,
"type": "clean",
"offset": 6.394,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00003",
"audio_path": "audio/fma_00003.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.922,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00004",
"audio_path": "audio/fma_00004.wav",
"duration": 5.0,
"type": "clean",
"offset": 4.219,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00006",
"audio_path": "audio/fma_00006.wav",
"duration": 5.0,
"type": "clean",
"offset": 0.265,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00009",
"audio_path": "audio/fma_00009.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.094,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00011",
"audio_path": "audio/fma_00011.wav",
"duration": 5.0,
"type": "clean",
"offset": 3.403,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00013",
"audio_path": "audio/fma_00013.wav",
"duration": 5.0,
"type": "clean",
"offset": 0.927,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00020",
"audio_path": "audio/fma_00020.wav",
"duration": 5.0,
"type": "clean",
"offset": 7.046,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00000",
"audio_path": "audio/fma_00000.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00001",
"audio_path": "audio/fma_00001.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00002",
"audio_path": "audio/fma_00002.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00003",
"audio_path": "audio/fma_00003.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00004",
"audio_path": "audio/fma_00004.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00005",
"audio_path": "audio/fma_00005.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00006",
"audio_path": "audio/fma_00006.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00007",
"audio_path": "audio/fma_00007.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00008",
"audio_path": "audio/fma_00008.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00009",
"audio_path": "audio/fma_00009.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00010",
"audio_path": "audio/fma_00010.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00011",
"audio_path": "audio/fma_00011.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00012",
"audio_path": "audio/fma_00012.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00013",
"audio_path": "audio/fma_00013.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00014",
"audio_path": "audio/fma_00014.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00015",
"audio_path": "audio/fma_00015.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00016",
"audio_path": "audio/fma_00016.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00017",
"audio_path": "audio/fma_00017.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00018",
"audio_path": "audio/fma_00018.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00019",
"audio_path": "audio/fma_00019.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00020",
"audio_path": "audio/fma_00020.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00021",
"audio_path": "audio/fma_00021.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00022",
"audio_path": "audio/fma_00022.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00023",
"audio_path": "audio/fma_00023.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
}
]
\ No newline at end of file
[
{
"song_id": "fma_00001",
"audio_path": "audio/fma_00001.wav",
"duration": 5.0,
"type": "clean",
"offset": 2.75,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00002",
"audio_path": "audio/fma_00002.wav",
"duration": 5.0,
"type": "clean",
"offset": 7.365,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00005",
"audio_path": "audio/fma_00005.wav",
"duration": 5.0,
"type": "clean",
"offset": 2.186,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00007",
"audio_path": "audio/fma_00007.wav",
"duration": 5.0,
"type": "clean",
"offset": 6.499,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00008",
"audio_path": "audio/fma_00008.wav",
"duration": 5.0,
"type": "clean",
"offset": 2.204,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00010",
"audio_path": "audio/fma_00010.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.058,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00012",
"audio_path": "audio/fma_00012.wav",
"duration": 5.0,
"type": "clean",
"offset": 9.572,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00014",
"audio_path": "audio/fma_00014.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.475,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00015",
"audio_path": "audio/fma_00015.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.071,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00016",
"audio_path": "audio/fma_00016.wav",
"duration": 5.0,
"type": "clean",
"offset": 5.362,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00017",
"audio_path": "audio/fma_00017.wav",
"duration": 5.0,
"type": "clean",
"offset": 3.785,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00018",
"audio_path": "audio/fma_00018.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.294,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00019",
"audio_path": "audio/fma_00019.wav",
"duration": 5.0,
"type": "clean",
"offset": 8.617,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00021",
"audio_path": "audio/fma_00021.wav",
"duration": 5.0,
"type": "clean",
"offset": 2.279,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00022",
"audio_path": "audio/fma_00022.wav",
"duration": 5.0,
"type": "clean",
"offset": 0.798,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00023",
"audio_path": "audio/fma_00023.wav",
"duration": 5.0,
"type": "clean",
"offset": 1.01,
"segment_type": "external_query",
"source_dataset": "fma"
},
{
"song_id": "fma_00000",
"audio_path": "audio/fma_00000.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00001",
"audio_path": "audio/fma_00001.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00002",
"audio_path": "audio/fma_00002.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00003",
"audio_path": "audio/fma_00003.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00004",
"audio_path": "audio/fma_00004.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00005",
"audio_path": "audio/fma_00005.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00006",
"audio_path": "audio/fma_00006.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00007",
"audio_path": "audio/fma_00007.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00008",
"audio_path": "audio/fma_00008.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00009",
"audio_path": "audio/fma_00009.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00010",
"audio_path": "audio/fma_00010.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00011",
"audio_path": "audio/fma_00011.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00012",
"audio_path": "audio/fma_00012.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00013",
"audio_path": "audio/fma_00013.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00014",
"audio_path": "audio/fma_00014.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00015",
"audio_path": "audio/fma_00015.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00016",
"audio_path": "audio/fma_00016.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00017",
"audio_path": "audio/fma_00017.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00018",
"audio_path": "audio/fma_00018.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00019",
"audio_path": "audio/fma_00019.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00020",
"audio_path": "audio/fma_00020.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00021",
"audio_path": "audio/fma_00021.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00022",
"audio_path": "audio/fma_00022.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
},
{
"song_id": "fma_00023",
"audio_path": "audio/fma_00023.wav",
"duration": 15.0,
"type": "reference",
"source_dataset": "fma"
}
]
\ No newline at end of file
......@@ -32,6 +32,7 @@ class ACRDataset(Dataset):
self.augment = augment
self.n_crops = n_crops_per_song
self.data_dir = Path(data_dir)
self.asset_root = self.data_dir.parent if self.data_dir.name == "manifests" else self.data_dir
meta_path = self.data_dir / f"{split}.json"
with open(meta_path) as f:
......@@ -41,7 +42,7 @@ class ACRDataset(Dataset):
for item in self.metadata:
if references_only and item.get("type") != "reference":
continue
song_path = self.data_dir / item["audio_path"]
song_path = self.asset_root / item["audio_path"]
if song_path.exists():
self.samples.append(item)
......@@ -75,7 +76,7 @@ class ACRDataset(Dataset):
max_offset = max(0, duration - 5.0)
offset = random.uniform(0, max_offset) if max_offset > 0 else 0
audio_path = self.data_dir / sample["audio_path"]
audio_path = self.asset_root / sample["audio_path"]
y = self._load_segment(str(audio_path), offset, 5.0)
if self.augment and sample.get("type") != "reference":
......@@ -113,6 +114,7 @@ class ACRTestDataset(Dataset):
self.n_fft = n_fft
self.hop_length = hop_length
self.data_dir = Path(data_dir)
self.asset_root = self.data_dir.parent if self.data_dir.name == "manifests" else self.data_dir
meta_path = self.data_dir / f"{split}.json"
with open(meta_path) as f:
......@@ -120,7 +122,7 @@ class ACRTestDataset(Dataset):
self.samples = []
for item in self.metadata:
p = self.data_dir / item["audio_path"]
p = self.asset_root / item["audio_path"]
if p.exists():
self.samples.append(item)
......@@ -132,7 +134,7 @@ class ACRTestDataset(Dataset):
def __getitem__(self, idx):
sample = self.samples[idx]
audio_path = self.data_dir / sample["audio_path"]
audio_path = self.asset_root / sample["audio_path"]
y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True, offset=0, duration=min(sample["duration"], 5.0))
seg_len = 5 * self.sr
if len(y) < seg_len:
......@@ -178,6 +180,7 @@ class SongPairDataset(Dataset):
self.segment_len = int(segment_dur * sr)
self.augment = augment
self.data_dir = Path(data_dir)
self.asset_root = self.data_dir.parent if self.data_dir.name == "manifests" else self.data_dir
with open(self.data_dir / f"{split}.json") as f:
metadata = json.load(f)
......@@ -186,7 +189,7 @@ class SongPairDataset(Dataset):
for item in metadata:
if item.get("type") == "reference":
continue
p = self.data_dir / item["audio_path"]
p = self.asset_root / item["audio_path"]
if p.exists():
self.by_song.setdefault(item["song_id"], []).append(item)
......@@ -207,7 +210,7 @@ class SongPairDataset(Dataset):
return len(self.sample_song_ids)
def _load_clip(self, sample: Dict) -> np.ndarray:
path = self.data_dir / sample["audio_path"]
path = self.asset_root / sample["audio_path"]
y, _ = librosa.load(str(path), sr=self.sr, mono=True, duration=5.0)
if len(y) < self.segment_len:
y = np.pad(y, (0, self.segment_len - len(y)))
......
......@@ -6,6 +6,7 @@ import argparse
import csv
import json
import random
import shutil
from pathlib import Path
from typing import List, Dict
import soundfile as sf
......@@ -49,13 +50,19 @@ def build_train_eval_from_audio_dir(
output_dir.mkdir(parents=True, exist_ok=True)
manifests_dir = output_dir / "manifests"
manifests_dir.mkdir(parents=True, exist_ok=True)
audio_out_dir = output_dir / "audio"
audio_out_dir.mkdir(parents=True, exist_ok=True)
refs = []
train = []
test = []
for idx, path in enumerate(files):
rel = path.relative_to(output_dir.parent if output_dir.parent in path.parents else audio_dir.parent)
target_name = f"{source_dataset}_{idx:05d}{path.suffix.lower()}"
target_path = audio_out_dir / target_name
if not target_path.exists():
shutil.copy2(path, target_path)
rel = target_path.relative_to(output_dir)
song_id = f"{source_dataset}_{idx:05d}"
try:
info = sf.info(str(path))
......
......@@ -50,6 +50,28 @@
- 现在开放数据接入路径已经浓缩成单页可执行工作流
- 后续接真实 FMA / MTG-Jamendo 本地目录时,上手成本更低
### Stage: 开放数据 manifests 直连训练
完成项:
- 修复 `src/data/manifest_tools.py` 生成的开放数据 manifests 路径自洽性
- 让开放数据音频复制到输出根下的 `audio/`
- 修复 `src/data/dataset.py``.../manifests` 目录布局的路径解析
- 打通 `prepare-local -> validate-local -> train.py --dry-run`
验证结果:
- `/usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma data/synthetic_v2/songs --output-root data/external_ingested/synthetic_as_open_fixed --eval-ratio 0.2 --query-duration 5.0` 成功
- `/usr/local/miniconda3/bin/python src/data/external_adapters.py validate-local fma data/external_ingested/synthetic_as_open_fixed/fma/manifests` 成功
- `/usr/local/miniconda3/bin/python train.py --data data/external_ingested/synthetic_as_open_fixed/fma/manifests --output data/models_open_smoke_fixed --device cpu --epochs 1 --batch-size 2 --dry-run` 成功
- 当前结果:
- `catalog=24`
- `train_queries=16`
- `test_queries=8`
- `Dry run passed!`
结论:
- 开放数据路径现在不仅能生成 manifests,还能真正进入训练
- 后续接入真实 FMA / MTG-Jamendo 时,可以直接走同一链路
### Stage: confused 定向优化 v6(sample-level weighting)
完成项:
......
......@@ -20,8 +20,8 @@ flowchart LR
A[Local Open Audio Dir] --> B[inspect-local / inspect-batch]
B --> C[prepare-local]
C --> D[validate-local]
D --> E[train.json]
D --> F[test.json]
D --> E[train.py]
D --> F[evaluate.py]
```
---
......@@ -34,6 +34,7 @@ flowchart LR
| 批量比较 | [`src/data/external_adapters.py`](../acr-engine/src/data/external_adapters.py) `inspect-batch ...` | 比较多个候选目录 |
| 生成清单 | [`src/data/external_adapters.py`](../acr-engine/src/data/external_adapters.py) `prepare-local ...` | 产出 train/test/catalog |
| 训练前校验 | [`src/data/external_adapters.py`](../acr-engine/src/data/external_adapters.py) `validate-local ...` | 确认结构正确 |
| 训练 smoke | [`train.py`](../acr-engine/train.py) `--data ... --dry-run` | 验证 manifests 可直接进入训练 |
---
......@@ -45,6 +46,7 @@ flowchart LR
/usr/local/miniconda3/bin/python src/data/external_adapters.py inspect-local fma data/raw/fma_small_audio --eval-ratio 0.2 --query-duration 8.0
/usr/local/miniconda3/bin/python src/data/external_adapters.py prepare-local fma data/raw/fma_small_audio --output-root data/external_ingested --eval-ratio 0.2 --query-duration 8.0
/usr/local/miniconda3/bin/python src/data/external_adapters.py validate-local fma data/external_ingested/fma/manifests
/usr/local/miniconda3/bin/python train.py --data data/external_ingested/fma/manifests --output data/models_fma_smoke --device cpu --epochs 1 --batch-size 2 --dry-run
```
### 3.2 多目录比较
......@@ -78,6 +80,8 @@ flowchart LR
- `test_queries=8`
- `validate-local`
- `ok=true`
- `train.py --dry-run`
- `Dry run passed! Pipeline is working.`
---
......