Commit 2d501547 2d5015474d13a959055855285c1df8cb66847bc4 by cnb.bofCdSsphPA

Keep raw open-music assets out of normal git history

Constraint: The user wants real datasets added locally and potentially pushed, which would make ordinary git history fragile without LFS guardrails
Rejected: Download first and retrofit tracking later | Risks oversized commits and inconsistent reproducibility rules
Confidence: high
Scope-risk: narrow
Directive: Route all future raw corpus archives and audio under acr-engine/data/raw through LFS unless a smaller manifest-only alternative is explicitly chosen
Tested: git lfs version; git check-attr filter -- acr-engine/data/raw/fma_small_audio/example.wav; git check-attr filter -- acr-engine/data/raw/archive.zip
Not-tested: Actual large-file add/push against remote LFS storage remains pending until real dataset files are downloaded
1 parent f2360135
acr-engine/data/raw/**/*.zip filter=lfs diff=lfs merge=lfs -text
acr-engine/data/raw/**/*.tar filter=lfs diff=lfs merge=lfs -text
acr-engine/data/raw/**/*.tar.gz filter=lfs diff=lfs merge=lfs -text
acr-engine/data/raw/**/*.tgz filter=lfs diff=lfs merge=lfs -text
acr-engine/data/raw/**/*.wav filter=lfs diff=lfs merge=lfs -text
acr-engine/data/raw/**/*.mp3 filter=lfs diff=lfs merge=lfs -text
acr-engine/data/raw/**/*.flac filter=lfs diff=lfs merge=lfs -text
acr-engine/data/raw/**/*.ogg filter=lfs diff=lfs merge=lfs -text
# Local Open-Music Drop Zones
# Raw Open Music Drop Zones
Put real downloaded open-music audio files here before running the one-shot smoke flow.
## One-screen summary
## Recommended folders
- `data/raw/fma_small_audio/`
- `data/raw/mtg_jamendo_audio/`
| Dataset | Local directory | Primary use here | Current status | License note |
|---|---|---|---|---|
| FMA Small | [fma_small_audio/](./fma_small_audio/) | first real-data smoke + training baseline | not downloaded | track-level terms vary; verify before broader use |
| MTG-Jamendo | [mtg_jamendo_audio/](./mtg_jamendo_audio/) | retrieval/eval smoke and corpus compatibility checks | not downloaded | common usage path is research-oriented; do **not** treat as commercial-ready by default |
## Recommended order
```mermaid
flowchart LR
A[Download / place local audio] --> B[check-local-ready]
B --> C[inspect-local]
C --> D[smoke-local]
```
## Minimal commands
### 1) Verify the folder is actually usable
## Next command
For FMA:
```bash
/usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local fma data/raw/fma_small_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2
/usr/local/miniconda3/bin/python src/data/external_adapters.py check-local-ready fma data/raw/fma_small_audio --eval-ratio 0.2 --query-duration 8.0
/usr/local/miniconda3/bin/python src/data/external_adapters.py check-local-ready mtg_jamendo data/raw/mtg_jamendo_audio --eval-ratio 0.2 --query-duration 8.0
```
For MTG-Jamendo:
### 2) Run the full smoke pipeline
```bash
/usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local fma data/raw/fma_small_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2
/usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local mtg_jamendo data/raw/mtg_jamendo_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2
```
## Git LFS policy
Large raw archives and audio under `data/raw/` are tracked through Git LFS via [/.gitattributes](../../.gitattributes).
## Download / ingestion policy
1. Prefer **small, verifiable subsets first**.
2. Keep original archives under `data/raw/` only when they are truly needed for reproducibility.
3. For MTG-Jamendo, keep the corpus in an **evaluation/research** lane unless a verified commercial-safe subset is explicitly documented.
4. Record source URL, subset choice, and license evidence in docs before broadening training scope.
......
......@@ -222,6 +222,31 @@
- 交接信息更适合自动化和长期持续开发
### Stage: 原始开放数据 LFS 治理
完成项:
- 新增根目录 [/.gitattributes](../.gitattributes)
-`acr-engine/data/raw/` 下的大文件与音频配置 Git LFS 跟踪策略
- 重写 [acr-engine/data/raw/README.md](../acr-engine/data/raw/README.md),补充:
- 数据落点职责表
- `check-local-ready -> smoke-local` 最短路径
- 原始数据下载 / LFS 治理策略
- 补充 [docs/dataset-sources-and-licensing.md](./dataset-sources-and-licensing.md) 的下载 / LFS 治理说明
验证结果:
- `git lfs version` 成功
- `git check-attr filter -- acr-engine/data/raw/fma_small_audio/example.wav` 返回 `filter: lfs`
- `git check-attr filter -- acr-engine/data/raw/archive.zip` 返回 `filter: lfs`
- 文档已明确区分:
- 工程可用性
- benchmark 适用性
- 商用可部署性
结论:
- 仓库现在具备承接真实开放音频和压缩包的 LFS 基础设施
- 后续下载真实数据时,不会直接把大文件塞进普通 git 历史
### Stage: 真实数据就绪度守门
完成项:
......
......@@ -96,3 +96,35 @@ flowchart LR
## Sources
- See [references-and-sources.md](./references-and-sources.md) for the current source map.
## Download / LFS governance
### Preferred repository behavior
```mermaid
flowchart TD
A[Upstream dataset source] --> B[Local raw drop zone]
B --> C[Git LFS tracked large files]
C --> D[check-local-ready]
D --> E[prepare-local / smoke-local]
```
### Current repo policy
| Item | Policy | Reason |
|---|---|---|
| `acr-engine/data/raw/**/*.zip` | Git LFS | avoid bloating normal git history |
| `acr-engine/data/raw/**/*.wav` / `.mp3` / `.flac` / `.ogg` | Git LFS | allow local reproducibility without normal blob explosion |
| FMA Small | acceptable as first real-data engineering baseline | easiest realistic open music smoke path |
| MTG-Jamendo | default to research/eval lane | do not assume commercial-safe rights without subset-specific proof |
### Operational note
Even when a dataset is technically downloadable, this project should separate:
- **engineering usability**
- **benchmark suitability**
- **commercial deployment suitability**
These are not the same thing.
......