Commit b6cdf668 b6cdf668df34481116416b05c30c2a2756ebe7d4 by cnb.bofCdSsphPA

Bias music training crops toward salient energy and attack regions

Constraint: Music ACR queries should be closer to choruses, strong rhythmic sections, and attack regions without giving up the existing random and silence-aware fallbacks
Rejected: Add only heavier beat/chorus modeling first | higher complexity and more brittle than lightweight energy/onset heuristics for the current training pipeline
Confidence: high
Scope-risk: moderate
Directive: Keep high_energy/onset_aware as heuristic candidate generators; future beat/chorus logic should layer on top of them rather than replace the fallback stack
Tested: /usr/local/miniconda3/bin/python -m py_compile acr-engine/src/data/dataset.py acr-engine/src/data/manifest_tools.py acr-engine/train.py acr-engine/src/data/external_adapters.py; synthetic_v2 dry-run with --segment-strategy high_energy and onset_aware; handcrafted 20s audio fixture with high_energy/onset_aware query offset checks
Not-tested: Full retraining/evaluation impact on FMA or internal production datasets
1 parent 4ceaa995
......@@ -9,6 +9,61 @@ import torch
from torch.utils.data import Dataset
def compute_candidate_offsets(
y: np.ndarray,
sr: int,
segment_len: int,
strategy: str,
silence_top_db: int,
) -> List[int]:
if len(y) <= segment_len:
return [0]
if strategy == "silence_aware":
intervals = librosa.effects.split(y, top_db=silence_top_db)
if intervals is None or len(intervals) == 0:
return []
offsets = []
for start, end in intervals:
start = int(start)
end = int(end)
if end - start >= segment_len:
offsets.append(start)
last = end - segment_len
if last > start:
offsets.append(last)
return sorted(set(offsets))
if strategy == "high_energy":
hop = max(segment_len // 2, 1)
scores: List[tuple[float, int]] = []
for start in range(0, max(len(y) - segment_len + 1, 1), hop):
seg = y[start : start + segment_len]
if len(seg) < segment_len:
seg = np.pad(seg, (0, segment_len - len(seg)))
rms = float(np.sqrt(np.mean(np.square(seg)) + 1e-12))
scores.append((rms, start))
scores.sort(key=lambda x: x[0], reverse=True)
return [start for _, start in scores[: min(6, len(scores))]]
if strategy == "onset_aware":
try:
onset_frames = librosa.onset.onset_detect(y=y, sr=sr, hop_length=512, units="frames")
onset_samples = librosa.frames_to_samples(onset_frames, hop_length=512)
except Exception:
onset_samples = np.array([], dtype=int)
if onset_samples.size == 0:
return []
offsets = []
max_start = max(len(y) - segment_len, 0)
for onset in onset_samples.tolist():
start = max(0, min(int(onset), max_start))
offsets.append(start)
return sorted(set(offsets[: min(8, len(offsets))]))
return []
class ACRDataset(Dataset):
def __init__(
self,
......@@ -74,15 +129,9 @@ class ACRDataset(Dataset):
)
return librosa.power_to_db(mel, ref=np.max)
def _find_non_silent_intervals(self, y: np.ndarray) -> List[tuple[int, int]]:
intervals = librosa.effects.split(y, top_db=self.silence_top_db)
if intervals is None or len(intervals) == 0:
return [(0, len(y))]
return [(int(s), int(e)) for s, e in intervals]
def _choose_offset(self, sample: Dict, audio_path: Path) -> float:
duration = float(sample["duration"])
max_offset = max(0.0, duration - 5.0)
max_offset = max(0.0, duration - (self.segment_len / self.sr))
if max_offset <= 0:
return 0.0
......@@ -90,26 +139,31 @@ class ACRDataset(Dataset):
return random.uniform(0, max_offset)
y, _ = librosa.load(str(audio_path), sr=self.sr, mono=True)
target_len = self.segment_len
intervals = self._find_non_silent_intervals(y)
valid_intervals = []
for start, end in intervals:
if end - start >= target_len:
valid_intervals.append((start, end))
if self.segment_strategy == "silence_aware":
if valid_intervals:
start, end = random.choice(valid_intervals)
seg_max_start = max(start, end - target_len)
chosen = random.randint(start, seg_max_start) if seg_max_start > start else start
direct_candidates = compute_candidate_offsets(
y=y,
sr=self.sr,
segment_len=self.segment_len,
strategy=self.segment_strategy,
silence_top_db=self.silence_top_db,
)
if direct_candidates:
chosen = random.choice(direct_candidates)
return min(chosen / self.sr, max_offset)
return random.uniform(0, max_offset)
if self.segment_strategy == "hybrid":
if valid_intervals and random.random() < 0.7:
start, end = random.choice(valid_intervals)
seg_max_start = max(start, end - target_len)
chosen = random.randint(start, seg_max_start) if seg_max_start > start else start
candidate_pool: List[int] = []
for strategy in ("high_energy", "onset_aware", "silence_aware"):
candidate_pool.extend(
compute_candidate_offsets(
y=y,
sr=self.sr,
segment_len=self.segment_len,
strategy=strategy,
silence_top_db=self.silence_top_db,
)
)
if candidate_pool and random.random() < 0.75:
chosen = random.choice(sorted(set(candidate_pool)))
return min(chosen / self.sr, max_offset)
return random.uniform(0, max_offset)
......@@ -260,24 +314,37 @@ class SongPairDataset(Dataset):
path = self.asset_root / sample["audio_path"]
full_y, _ = librosa.load(str(path), sr=self.sr, mono=True)
duration = float(sample.get("duration", len(full_y) / self.sr))
max_offset = max(0.0, duration - 5.0)
max_offset = max(0.0, duration - (self.segment_len / self.sr))
offset = 0.0
if max_offset > 0:
if self.segment_strategy == "random":
offset = random.uniform(0, max_offset)
else:
intervals = librosa.effects.split(full_y, top_db=self.silence_top_db)
valid = [(int(s), int(e)) for s, e in intervals if int(e) - int(s) >= self.segment_len] if len(intervals) else []
if self.segment_strategy == "silence_aware" and valid:
start, end = random.choice(valid)
seg_max_start = max(start, end - self.segment_len)
chosen = random.randint(start, seg_max_start) if seg_max_start > start else start
offset = min(chosen / self.sr, max_offset)
elif self.segment_strategy == "hybrid" and valid and random.random() < 0.7:
start, end = random.choice(valid)
seg_max_start = max(start, end - self.segment_len)
chosen = random.randint(start, seg_max_start) if seg_max_start > start else start
offset = min(chosen / self.sr, max_offset)
direct_candidates = compute_candidate_offsets(
y=full_y,
sr=self.sr,
segment_len=self.segment_len,
strategy=self.segment_strategy,
silence_top_db=self.silence_top_db,
)
if direct_candidates:
offset = min(random.choice(direct_candidates) / self.sr, max_offset)
elif self.segment_strategy == "hybrid":
candidate_pool: List[int] = []
for strategy in ("high_energy", "onset_aware", "silence_aware"):
candidate_pool.extend(
compute_candidate_offsets(
y=full_y,
sr=self.sr,
segment_len=self.segment_len,
strategy=strategy,
silence_top_db=self.silence_top_db,
)
)
if candidate_pool and random.random() < 0.75:
offset = min(random.choice(sorted(set(candidate_pool))) / self.sr, max_offset)
else:
offset = random.uniform(0, max_offset)
else:
offset = random.uniform(0, max_offset)
start = int(offset * self.sr)
......
......@@ -516,7 +516,7 @@ def main():
p.add_argument("--eval-ratio", type=float, default=0.2)
p.add_argument("--query-duration", type=float, default=8.0)
p.add_argument("--query-stride", type=float, default=None)
p.add_argument("--query-strategy", choices=["random", "sliding", "silence_aware", "hybrid"], default="random")
p.add_argument("--query-strategy", choices=["random", "sliding", "silence_aware", "high_energy", "onset_aware", "hybrid"], default="random")
p.add_argument("--silence-top-db", type=int, default=30)
p.add_argument("--seed", type=int, default=42)
......@@ -548,8 +548,8 @@ def main():
p.add_argument("--eval-ratio", type=float, default=0.2)
p.add_argument("--query-duration", type=float, default=8.0)
p.add_argument("--query-stride", type=float, default=None)
p.add_argument("--query-strategy", choices=["random", "sliding", "silence_aware", "hybrid"], default="random")
p.add_argument("--segment-strategy", choices=["random", "silence_aware", "hybrid"], default="random")
p.add_argument("--query-strategy", choices=["random", "sliding", "silence_aware", "high_energy", "onset_aware", "hybrid"], default="random")
p.add_argument("--segment-strategy", choices=["random", "silence_aware", "high_energy", "onset_aware", "hybrid"], default="random")
p.add_argument("--silence-top-db", type=int, default=30)
p.add_argument("--index-checkpoint-every-refs", type=int, default=100)
p.add_argument("--seed", type=int, default=42)
......
......@@ -7,12 +7,19 @@ import csv
import json
import random
import shutil
import sys
from pathlib import Path
from typing import List, Dict
import numpy as np
import soundfile as sf
import librosa
ROOT = Path(__file__).resolve().parents[2]
if str(ROOT) not in sys.path:
sys.path.insert(0, str(ROOT))
from src.data.dataset import compute_candidate_offsets
def write_catalog(records: List[Dict], output_path: Path):
output_path.parent.mkdir(parents=True, exist_ok=True)
......@@ -62,34 +69,26 @@ def build_train_eval_from_audio_dir(
train = []
test = []
def compute_silence_aware_offsets(path: Path, duration: float) -> List[float]:
def compute_strategy_offsets(path: Path, duration: float, strategy: str) -> List[float]:
if duration < query_duration:
return []
try:
y, sr = librosa.load(str(path), sr=None, mono=True)
intervals = librosa.effects.split(y, top_db=silence_top_db)
if intervals is None or len(intervals) == 0:
raise ValueError("no_non_silent_intervals")
offsets = []
target_len = int(query_duration * sr)
for start, end in intervals:
candidates = compute_candidate_offsets(
y=y,
sr=sr,
segment_len=target_len,
strategy=strategy,
silence_top_db=silence_top_db,
)
offsets = []
for start in candidates:
start = int(start)
end = int(end)
if end - start < target_len:
continue
if query_stride and query_stride > 0:
stride = int(query_stride * sr)
local_positions = list(range(start, max(start + 1, end - target_len + 1), stride))
if not local_positions:
local_positions = [start]
last_pos = end - target_len
if last_pos >= start and local_positions[-1] != last_pos:
local_positions.append(last_pos)
offsets.extend([round(pos / sr, 3) for pos in local_positions])
if query_stride and query_stride > 0 and strategy in {"silence_aware"}:
offsets.append(round(start / sr, 3))
else:
seg_max_start = max(start, end - target_len)
chosen = rng.randint(start, seg_max_start) if seg_max_start > start else start
offsets.append(round(chosen / sr, 3))
offsets.append(round(start / sr, 3))
return sorted(set(x for x in offsets if x <= max(0.0, duration - query_duration)))
except Exception:
return []
......@@ -117,20 +116,23 @@ def build_train_eval_from_audio_dir(
refs.append(ref)
if duration >= query_duration:
if query_strategy in {"silence_aware", "hybrid"}:
silence_offsets = compute_silence_aware_offsets(path, duration)
else:
silence_offsets = []
if query_strategy == "silence_aware" and silence_offsets:
offsets = silence_offsets
elif query_strategy == "hybrid" and silence_offsets:
strategy_offsets = []
if query_strategy in {"silence_aware", "high_energy", "onset_aware"}:
strategy_offsets = compute_strategy_offsets(path, duration, query_strategy)
elif query_strategy == "hybrid":
for strategy in ("high_energy", "onset_aware", "silence_aware"):
strategy_offsets.extend(compute_strategy_offsets(path, duration, strategy))
strategy_offsets = sorted(set(strategy_offsets))
if query_strategy in {"silence_aware", "high_energy", "onset_aware"} and strategy_offsets:
offsets = strategy_offsets
elif query_strategy == "hybrid" and strategy_offsets:
if query_stride and query_stride > 0:
offsets = silence_offsets
offsets = strategy_offsets
else:
max_offset = max(0.0, duration - query_duration)
random_offset = round(rng.uniform(0.0, max_offset) if max_offset > 0 else 0.0, 3)
offsets = sorted(set(silence_offsets + [random_offset]))
offsets = sorted(set(strategy_offsets + [random_offset]))
elif query_stride and query_stride > 0:
max_offset = max(0.0, duration - query_duration)
offsets = [round(x, 3) for x in np.arange(0.0, max_offset + 1e-9, query_stride).tolist()]
......@@ -275,7 +277,7 @@ def main():
p.add_argument("--eval-ratio", type=float, default=0.2)
p.add_argument("--query-duration", type=float, default=8.0)
p.add_argument("--query-stride", type=float, default=None)
p.add_argument("--query-strategy", choices=["random", "sliding", "silence_aware", "hybrid"], default="random")
p.add_argument("--query-strategy", choices=["random", "sliding", "silence_aware", "high_energy", "onset_aware", "hybrid"], default="random")
p.add_argument("--silence-top-db", type=int, default=30)
p.add_argument("--seed", type=int, default=42)
......
......@@ -125,7 +125,7 @@ def main():
parser.add_argument("--epochs", type=int, default=None)
parser.add_argument("--batch-size", type=int, default=None)
parser.add_argument("--lr", type=float, default=None)
parser.add_argument("--segment-strategy", choices=["random", "silence_aware", "hybrid"], default="random")
parser.add_argument("--segment-strategy", choices=["random", "silence_aware", "high_energy", "onset_aware", "hybrid"], default="random")
parser.add_argument("--silence-top-db", type=int, default=30)
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
......
......@@ -5522,3 +5522,50 @@
结论:
- `smoke-local` 现在已经具备“可恢复,但不会错误复用旧模型 embedding”的安全自动恢复能力
- 这对真实 FMA 这类 CPU 长时任务尤其重要:重启可续跑,换模型不会串污染 index
### Stage: high-energy / onset-aware music segmentation
完成项:
-`acr-engine/src/data/dataset.py` 新增训练切片候选策略:
- `high_energy`
- `onset_aware`
-`acr-engine/src/data/manifest_tools.py` 新增外部 query 生成策略:
- `--query-strategy high_energy`
- `--query-strategy onset_aware`
-`hybrid` 升级为可复用:
- `high_energy`
- `onset_aware`
- `silence_aware`
三类音乐感知候选,再补随机 fallback
-`train.py``external_adapters.py` 暴露新策略选项
-[docs/training-data-and-pgvector-guide.md](./training-data-and-pgvector-guide.md) 增补策略说明与使用建议
验证结果:
- 编译验证:
- `/usr/local/miniconda3/bin/python -m py_compile src/data/dataset.py src/data/manifest_tools.py train.py src/data/external_adapters.py`
- 人造音频验证:
- 构造 `20s` 音频:
- `4-6s` 低能 tone
- `8/10/12s` 强起音脉冲
- `14-19s` 高能 tone
- query 生成结果:
- `high_energy` offsets:
- `2.5, 7.5, 10.0, 12.5, 15.0`
- `onset_aware` offsets:
- `4.032, 6.048, 8.032, 10.016, 10.048, 12.032`
- 训练侧偏移验证:
- `TRAIN_HIGH_ENERGY_OFFSETS`
- `2.5, 15.0, 15.0, 2.5, 10.0, 12.5`
- `TRAIN_ONSET_OFFSETS`
- `4.064, 4.032, 10.016, 8.032, 8.032, 6.048`
- 说明新策略已明显偏向强能量区或起音邻域,而不是纯随机
- dry-run 验证:
- `train.py --data data/synthetic_v2 --dry-run --segment-strategy high_energy`
- forward/backward 成功,`Embedding shape: torch.Size([64, 192])`
结论:
- 当前项目的音乐感知切片已经从“避静音”扩展到了“偏主段 / 偏起音”
- 下一步若继续增强,可在此基础上叠加:
- beat-aware
- chorus-aware
- repeated-section-aware
......
......@@ -354,12 +354,14 @@ flowchart TD
| `random` | 训练 query | 增强泛化,模拟未知用户截取点 | 是 |
| `sliding` | 建库 / query 生成 | 保证覆盖率,减少漏召回 | 是 |
| `silence_aware` | 训练 query / 外部 query 生成 | 优先避开静音,落到真正有音乐内容的片段 | 是 |
| `high_energy` | 训练 query / 外部 query 生成 | 优先抽取 RMS 高能区,更接近副歌/主唱/强节奏段 | 是 |
| `onset_aware` | 训练 query / 外部 query 生成 | 优先靠近起音事件,减少截到拖尾/空拍 | 是 |
| `hybrid` | 训练 query / 外部 query 生成 | 混合 silence-aware + random,兼顾稳定性与泛化 | 是 |
推荐理解:
1. **训练不是全部随机切**
当前训练集可用 `random / silence_aware / hybrid`
当前训练集可用 `random / silence_aware / high_energy / onset_aware / hybrid`
2. **reference 建库不是随机切**
建库仍然是固定滑窗
3. **外部数据 query 生成也不是只能随机切**
......@@ -384,6 +386,8 @@ flowchart TD
- baseline:`random`
- 更稳的音乐任务:`hybrid`
- 已知原始音频静音很多:`silence_aware`
- 更想贴近副歌/强节奏:`high_energy`
- 更想贴近短音起点/打点:`onset_aware`
### 外部数据 query 生成推荐
......@@ -392,11 +396,20 @@ flowchart TD
--output-root data/external_ingested \
--query-duration 8 \
--query-stride 4 \
--query-strategy silence_aware \
--query-strategy high_energy \
--silence-top-db 30
```
这会优先从非静音区生成 query,而不是从长静音头尾随机采样。
这会优先从高能区生成 query,而不是从长静音头尾或低能过门里随机采样。
补充建议:
| 场景 | 推荐策略 |
|---|---|
| 录音静音头尾很多 | `silence_aware` |
| 更想贴近副歌/主段 | `high_energy` |
| 更想贴近打点/起唱点 | `onset_aware` |
| 既要音乐感知,又要保留泛化 | `hybrid` |
---
......