Commit 90e252b8 90e252b89770b0f4126cfff3e972cf3ad4e66d1f by cnb.bofCdSsphPA

Reduce silent-query noise in training and open-dataset preparation

Constraint: Real music queries often include long silence heads/tails, but the pipeline still needs random-crop generalization and simple CLI controls
Rejected: Replace all random crops with structure-aware segmentation | would overfit to curated boundaries and diverge from messy real-world query distributions
Confidence: high
Scope-risk: moderate
Directive: Keep random as fallback; layer beat/onset/chorus-aware segmentation on top instead of removing silence-aware and sliding paths
Tested: /usr/local/miniconda3/bin/python -m py_compile acr-engine/src/data/dataset.py acr-engine/src/data/manifest_tools.py acr-engine/train.py acr-engine/src/data/external_adapters.py; external_adapters.py prepare-local fma /tmp/segtest_audio --query-strategy silence_aware; train.py --data data/synthetic_v2 --dry-run --segment-strategy hybrid
Not-tested: Full FMA smoke retraining/eval with the new segmentation strategies
1 parent d61ee980
......@@ -23,6 +23,8 @@ class ACRDataset(Dataset):
n_crops_per_song: int = 4,
song_to_idx: Optional[Dict[str, int]] = None,
references_only: bool = False,
segment_strategy: str = "random",
silence_top_db: int = 30,
):
self.sr = sr
self.n_mels = n_mels
......@@ -31,6 +33,8 @@ class ACRDataset(Dataset):
self.segment_len = int(segment_dur * sr)
self.augment = augment
self.n_crops = n_crops_per_song
self.segment_strategy = segment_strategy
self.silence_top_db = silence_top_db
self.data_dir = Path(data_dir)
self.asset_root = self.data_dir.parent if self.data_dir.name == "manifests" else self.data_dir
......@@ -70,13 +74,52 @@ class ACRDataset(Dataset):
)
return librosa.power_to_db(mel, ref=np.max)
def _find_non_silent_intervals(self, y: np.ndarray) -> List[tuple[int, int]]:
intervals = librosa.effects.split(y, top_db=self.silence_top_db)
if intervals is None or len(intervals) == 0:
return [(0, len(y))]
return [(int(s), int(e)) for s, e in intervals]
def _choose_offset(self, sample: Dict, audio_path: Path) -> float:
duration = float(sample["duration"])
max_offset = max(0.0, duration - 5.0)
if max_offset <= 0:
return 0.0
if self.segment_strategy == "random":
return random.uniform(0, max_offset)
y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True)
target_len = self.segment_len
intervals = self._find_non_silent_intervals(y)
valid_intervals = []
for start, end in intervals:
if end - start >= target_len:
valid_intervals.append((start, end))
if self.segment_strategy == "silence_aware":
if valid_intervals:
start, end = random.choice(valid_intervals)
seg_max_start = max(start, end - target_len)
chosen = random.randint(start, seg_max_start) if seg_max_start > start else start
return min(chosen / self.sr, max_offset)
return random.uniform(0, max_offset)
if self.segment_strategy == "hybrid":
if valid_intervals and random.random() < 0.7:
start, end = random.choice(valid_intervals)
seg_max_start = max(start, end - target_len)
chosen = random.randint(start, seg_max_start) if seg_max_start > start else start
return min(chosen / self.sr, max_offset)
return random.uniform(0, max_offset)
return random.uniform(0, max_offset)
def __getitem__(self, idx):
sample = self.samples[idx // self.n_crops]
duration = sample["duration"]
max_offset = max(0, duration - 5.0)
offset = random.uniform(0, max_offset) if max_offset > 0 else 0
audio_path = self.asset_root / sample["audio_path"]
offset = self._choose_offset(sample, audio_path)
y = self._load_segment(str(audio_path), offset, 5.0)
if self.augment and sample.get("type") != "reference":
......@@ -172,6 +215,8 @@ class SongPairDataset(Dataset):
hop_length: int = 160,
segment_dur: float = 5.0,
augment: bool = True,
segment_strategy: str = "random",
silence_top_db: int = 30,
):
self.sr = sr
self.n_mels = n_mels
......@@ -179,6 +224,8 @@ class SongPairDataset(Dataset):
self.hop_length = hop_length
self.segment_len = int(segment_dur * sr)
self.augment = augment
self.segment_strategy = segment_strategy
self.silence_top_db = silence_top_db
self.data_dir = Path(data_dir)
self.asset_root = self.data_dir.parent if self.data_dir.name == "manifests" else self.data_dir
......@@ -211,11 +258,32 @@ class SongPairDataset(Dataset):
def _load_clip(self, sample: Dict) -> np.ndarray:
path = self.asset_root / sample["audio_path"]
y, _ = librosa.load(str(path), sr=self.sr, mono=True, duration=5.0)
full_y, _ = librosa.load(str(path), sr=self.sr, mono=True)
duration = float(sample.get("duration", len(full_y) / self.sr))
max_offset = max(0.0, duration - 5.0)
offset = 0.0
if max_offset > 0:
if self.segment_strategy == "random":
offset = random.uniform(0, max_offset)
else:
intervals = librosa.effects.split(full_y, top_db=self.silence_top_db)
valid = [(int(s), int(e)) for s, e in intervals if int(e) - int(s) >= self.segment_len] if len(intervals) else []
if self.segment_strategy == "silence_aware" and valid:
start, end = random.choice(valid)
seg_max_start = max(start, end - self.segment_len)
chosen = random.randint(start, seg_max_start) if seg_max_start > start else start
offset = min(chosen / self.sr, max_offset)
elif self.segment_strategy == "hybrid" and valid and random.random() < 0.7:
start, end = random.choice(valid)
seg_max_start = max(start, end - self.segment_len)
chosen = random.randint(start, seg_max_start) if seg_max_start > start else start
offset = min(chosen / self.sr, max_offset)
else:
offset = random.uniform(0, max_offset)
start = int(offset * self.sr)
y = full_y[start : start + self.segment_len]
if len(y) < self.segment_len:
y = np.pad(y, (0, self.segment_len - len(y)))
else:
y = y[: self.segment_len]
return y
def _to_mel(self, y: np.ndarray) -> torch.Tensor:
......
......@@ -104,6 +104,8 @@ class BaseAdapter:
eval_ratio: float = 0.2,
query_duration: float = 8.0,
query_stride: float | None = None,
query_strategy: str = "random",
silence_top_db: int = 30,
seed: int = 42,
) -> Dict:
output_root.mkdir(parents=True, exist_ok=True)
......@@ -126,6 +128,12 @@ class BaseAdapter:
str(query_stride),
])
cmd.extend([
"--query-strategy",
str(query_strategy),
"--silence-top-db",
str(silence_top_db),
])
cmd.extend([
"--seed",
str(seed),
])
......@@ -361,6 +369,9 @@ def smoke_local_dataset(
eval_ratio: float,
query_duration: float,
query_stride: float | None,
query_strategy: str,
segment_strategy: str,
silence_top_db: int,
seed: int,
train_epochs: int,
batch_size: int,
......@@ -388,6 +399,8 @@ def smoke_local_dataset(
eval_ratio=eval_ratio,
query_duration=query_duration,
query_stride=query_stride,
query_strategy=query_strategy,
silence_top_db=silence_top_db,
seed=seed,
)
manifests_dir = Path(prepare_summary["output_dir"])
......@@ -407,6 +420,8 @@ def smoke_local_dataset(
"--device", resolved_device,
"--epochs", str(train_epochs),
"--batch-size", str(batch_size),
"--segment-strategy", str(segment_strategy),
"--silence-top-db", str(silence_top_db),
], check=True)
subprocess.run([
......@@ -444,6 +459,9 @@ def smoke_local_dataset(
base_cfg=base_cfg,
)
config["data"]["manifest_query_stride"] = query_stride
config["data"]["manifest_query_strategy"] = query_strategy
config["data"]["silence_top_db"] = silence_top_db
config["run"]["train_segment_strategy"] = segment_strategy
report_dir.mkdir(parents=True, exist_ok=True)
config_path.write_text(json.dumps(config, indent=2))
......@@ -493,6 +511,8 @@ def main():
p.add_argument("--eval-ratio", type=float, default=0.2)
p.add_argument("--query-duration", type=float, default=8.0)
p.add_argument("--query-stride", type=float, default=None)
p.add_argument("--query-strategy", choices=["random", "sliding", "silence_aware", "hybrid"], default="random")
p.add_argument("--silence-top-db", type=int, default=30)
p.add_argument("--seed", type=int, default=42)
p = sub.add_parser("inspect-local")
......@@ -523,6 +543,9 @@ def main():
p.add_argument("--eval-ratio", type=float, default=0.2)
p.add_argument("--query-duration", type=float, default=8.0)
p.add_argument("--query-stride", type=float, default=None)
p.add_argument("--query-strategy", choices=["random", "sliding", "silence_aware", "hybrid"], default="random")
p.add_argument("--segment-strategy", choices=["random", "silence_aware", "hybrid"], default="random")
p.add_argument("--silence-top-db", type=int, default=30)
p.add_argument("--seed", type=int, default=42)
p.add_argument("--train-epochs", type=int, default=1)
p.add_argument("--batch-size", type=int, default=2)
......@@ -545,6 +568,8 @@ def main():
eval_ratio=args.eval_ratio,
query_duration=args.query_duration,
query_stride=args.query_stride,
query_strategy=args.query_strategy,
silence_top_db=args.silence_top_db,
seed=args.seed,
)
print(json.dumps(summary, indent=2, ensure_ascii=False))
......@@ -577,6 +602,9 @@ def main():
eval_ratio=args.eval_ratio,
query_duration=args.query_duration,
query_stride=args.query_stride,
query_strategy=args.query_strategy,
segment_strategy=args.segment_strategy,
silence_top_db=args.silence_top_db,
seed=args.seed,
train_epochs=args.train_epochs,
batch_size=args.batch_size,
......
......@@ -11,6 +11,7 @@ from pathlib import Path
from typing import List, Dict
import numpy as np
import soundfile as sf
import librosa
def write_catalog(records: List[Dict], output_path: Path):
......@@ -45,6 +46,8 @@ def build_train_eval_from_audio_dir(
eval_ratio: float = 0.2,
query_duration: float = 8.0,
query_stride: float | None = None,
query_strategy: str = "random",
silence_top_db: int = 30,
seed: int = 42,
):
rng = random.Random(seed)
......@@ -59,6 +62,38 @@ def build_train_eval_from_audio_dir(
train = []
test = []
def compute_silence_aware_offsets(path: Path, duration: float) -> List[float]:
if duration < query_duration:
return []
try:
y, sr = librosa.load(str(path), sr=None, mono=True)
intervals = librosa.effects.split(y, top_db=silence_top_db)
if intervals is None or len(intervals) == 0:
raise ValueError("no_non_silent_intervals")
offsets = []
target_len = int(query_duration * sr)
for start, end in intervals:
start = int(start)
end = int(end)
if end - start < target_len:
continue
if query_stride and query_stride > 0:
stride = int(query_stride * sr)
local_positions = list(range(start, max(start + 1, end - target_len + 1), stride))
if not local_positions:
local_positions = [start]
last_pos = end - target_len
if last_pos >= start and local_positions[-1] != last_pos:
local_positions.append(last_pos)
offsets.extend([round(pos / sr, 3) for pos in local_positions])
else:
seg_max_start = max(start, end - target_len)
chosen = rng.randint(start, seg_max_start) if seg_max_start > start else start
offsets.append(round(chosen / sr, 3))
return sorted(set(x for x in offsets if x <= max(0.0, duration - query_duration)))
except Exception:
return []
for idx, path in enumerate(files):
target_name = f"{source_dataset}_{idx:05d}{path.suffix.lower()}"
target_path = audio_out_dir / target_name
......@@ -82,7 +117,21 @@ def build_train_eval_from_audio_dir(
refs.append(ref)
if duration >= query_duration:
if query_strategy in {"silence_aware", "hybrid"}:
silence_offsets = compute_silence_aware_offsets(path, duration)
else:
silence_offsets = []
if query_strategy == "silence_aware" and silence_offsets:
offsets = silence_offsets
elif query_strategy == "hybrid" and silence_offsets:
if query_stride and query_stride > 0:
offsets = silence_offsets
else:
max_offset = max(0.0, duration - query_duration)
random_offset = round(rng.uniform(0.0, max_offset) if max_offset > 0 else 0.0, 3)
offsets = sorted(set(silence_offsets + [random_offset]))
elif query_stride and query_stride > 0:
max_offset = max(0.0, duration - query_duration)
offsets = [round(x, 3) for x in np.arange(0.0, max_offset + 1e-9, query_stride).tolist()]
if not offsets:
......@@ -124,6 +173,7 @@ def build_train_eval_from_audio_dir(
"test_queries": len(test),
"query_duration": query_duration,
"query_stride": query_stride,
"query_strategy": query_strategy,
"output_dir": str(manifests_dir),
}
......@@ -225,6 +275,8 @@ def main():
p.add_argument("--eval-ratio", type=float, default=0.2)
p.add_argument("--query-duration", type=float, default=8.0)
p.add_argument("--query-stride", type=float, default=None)
p.add_argument("--query-strategy", choices=["random", "sliding", "silence_aware", "hybrid"], default="random")
p.add_argument("--silence-top-db", type=int, default=30)
p.add_argument("--seed", type=int, default=42)
p = sub.add_parser("inspect-audio-dir")
......@@ -247,6 +299,8 @@ def main():
eval_ratio=args.eval_ratio,
query_duration=args.query_duration,
query_stride=args.query_stride,
query_strategy=args.query_strategy,
silence_top_db=args.silence_top_db,
seed=args.seed,
)
print(json.dumps({"status": "ok", **summary}, ensure_ascii=False))
......
......@@ -125,6 +125,8 @@ def main():
parser.add_argument("--epochs", type=int, default=None)
parser.add_argument("--batch-size", type=int, default=None)
parser.add_argument("--lr", type=float, default=None)
parser.add_argument("--segment-strategy", choices=["random", "silence_aware", "hybrid"], default="random")
parser.add_argument("--silence-top-db", type=int, default=30)
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
......@@ -153,6 +155,8 @@ def main():
hop_length=cfg["data"]["hop_length"],
segment_dur=cfg["data"]["segment_dur"],
augment=True,
segment_strategy=args.segment_strategy,
silence_top_db=args.silence_top_db,
)
catalog_dataset = ACRDataset(
......@@ -166,6 +170,8 @@ def main():
augment=False,
n_crops_per_song=1,
song_to_idx=train_dataset.song_to_idx,
segment_strategy=args.segment_strategy,
silence_top_db=args.silence_top_db,
)
train_loader = DataLoader(
......
......@@ -5398,3 +5398,46 @@
- **人工标 offset 的短视频片段**:保持单条 query
- **只有整首音频、没有 query 起点的素材**:自动生成多窗口 query
- 这让 `7/8/16/18` 这类 query 型素材可以更直接进入训练与评测流水线,同时保留对 `pgvector` 入库的可追踪性
### Stage: silence-aware segmentation for training and open-dataset query generation
完成项:
-`acr-engine/src/data/dataset.py` 为训练切片新增:
- `segment_strategy=random|silence_aware|hybrid`
- `silence_top_db`
- 接入 `librosa.effects.split`,用于优先选择非静音区作为训练片段来源
-`acr-engine/src/data/manifest_tools.py` 为外部数据 query 生成新增:
- `--query-strategy random|sliding|silence_aware|hybrid`
- `--silence-top-db`
-`acr-engine/train.py` 暴露训练 CLI 参数:
- `--segment-strategy`
- `--silence-top-db`
-`acr-engine/src/data/external_adapters.py` 接通 `prepare-local` / `smoke-local` 的策略透传与配置落盘
-[docs/training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) 补充“切片策略”章节
验证结果:
- 代码编译验证:
- `/usr/local/miniconda3/bin/python -m py_compile src/data/dataset.py src/data/manifest_tools.py train.py src/data/external_adapters.py`
- 人造音频验证:
- 构造 `4s silence + 10s tone + 4s silence`
- `manifest_tools.py --query-strategy silence_aware --query-duration 5 --query-stride 2.5`
- 导出 query offset:`3.968, 8.968, 9.08`
- 说明 query 已明显偏向非静音主体区
- 训练侧偏移验证:
- `random` offset 样本:`0.325, 1.13, 2.902, 3.575, 8.313, 8.797, 9.574, 11.598`
- `silence_aware` offset 样本:`4.173, 4.228, 4.736, 5.111, 5.874, 5.974, 8.436, 8.805`
- 说明 silence-aware 显著减少落入头尾静音区的概率
- dry-run 验证:
- `train.py --data data/synthetic_v2 --dry-run --segment-strategy silence_aware`
- forward/backward 成功,`Embedding shape: torch.Size([64, 192])`
- adapter 验证:
- `external_adapters.py prepare-local ... --query-strategy silence_aware`
- summary 已记录 `query_strategy: silence_aware`
结论:
- 当前项目不再只有“随机切”
- 已形成:
- **训练侧**`random / silence_aware / hybrid`
- **建库侧**:固定滑窗
- **开源集 query 生成侧**`random / sliding / silence_aware / hybrid`
- 下一阶段可继续叠加 beat/onset/chorus-aware 切片,而无需推翻现有流程
......
......@@ -345,6 +345,61 @@ flowchart TD
---
## 11.5 切片策略:不要只用随机切
当前项目现在已经支持 4 类切片思路,但职责不同:
| 策略 | 适用位置 | 作用 | 是否已接入 |
|---|---|---|---|
| `random` | 训练 query | 增强泛化,模拟未知用户截取点 | 是 |
| `sliding` | 建库 / query 生成 | 保证覆盖率,减少漏召回 | 是 |
| `silence_aware` | 训练 query / 外部 query 生成 | 优先避开静音,落到真正有音乐内容的片段 | 是 |
| `hybrid` | 训练 query / 外部 query 生成 | 混合 silence-aware + random,兼顾稳定性与泛化 | 是 |
推荐理解:
1. **训练不是全部随机切**
当前训练集可用 `random / silence_aware / hybrid`
2. **reference 建库不是随机切**
建库仍然是固定滑窗
3. **外部数据 query 生成也不是只能随机切**
现在可选 `--query-strategy silence_aware`
为什么不直接完全依赖音乐结构分段?
- ACR 真实 query 往往来自短视频、录屏、随手截取,不一定对齐节拍或段落边界
- 先做 **静音感知分段**,收益最大、风险最小
- 更复杂的 beat / chorus / onset 分段可以作为下一阶段增强,而不应替代现有随机增强
### 训练侧推荐
```bash
/usr/local/miniconda3/bin/python acr-engine/train.py \
--data data/your_manifests \
--segment-strategy hybrid \
--silence-top-db 30
```
建议:
- baseline:`random`
- 更稳的音乐任务:`hybrid`
- 已知原始音频静音很多:`silence_aware`
### 外部数据 query 生成推荐
```bash
/usr/local/miniconda3/bin/python acr-engine/src/data/external_adapters.py prepare-local fma data/raw/fma_small_audio \
--output-root data/external_ingested \
--query-duration 8 \
--query-stride 4 \
--query-strategy silence_aware \
--silence-top-db 30
```
这会优先从非静音区生成 query,而不是从长静音头尾随机采样。
---
## 12. 你这批内部素材 type,哪些推荐参与训练
## 12.1 一页结论
......