Commit 2d501547 2d5015474d13a959055855285c1df8cb66847bc4 by cnb.bofCdSsphPA

Keep raw open-music assets out of normal git history

Constraint: The user wants real datasets added locally and potentially pushed, which would make ordinary git history fragile without LFS guardrails
Rejected: Download first and retrofit tracking later | Risks oversized commits and inconsistent reproducibility rules
Confidence: high
Scope-risk: narrow
Directive: Route all future raw corpus archives and audio under acr-engine/data/raw through LFS unless a smaller manifest-only alternative is explicitly chosen
Tested: git lfs version; git check-attr filter -- acr-engine/data/raw/fma_small_audio/example.wav; git check-attr filter -- acr-engine/data/raw/archive.zip
Not-tested: Actual large-file add/push against remote LFS storage remains pending until real dataset files are downloaded
1 parent f2360135
1 acr-engine/data/raw/**/*.zip filter=lfs diff=lfs merge=lfs -text
2 acr-engine/data/raw/**/*.tar filter=lfs diff=lfs merge=lfs -text
3 acr-engine/data/raw/**/*.tar.gz filter=lfs diff=lfs merge=lfs -text
4 acr-engine/data/raw/**/*.tgz filter=lfs diff=lfs merge=lfs -text
5 acr-engine/data/raw/**/*.wav filter=lfs diff=lfs merge=lfs -text
6 acr-engine/data/raw/**/*.mp3 filter=lfs diff=lfs merge=lfs -text
7 acr-engine/data/raw/**/*.flac filter=lfs diff=lfs merge=lfs -text
8 acr-engine/data/raw/**/*.ogg filter=lfs diff=lfs merge=lfs -text
1 # Local Open-Music Drop Zones 1 # Raw Open Music Drop Zones
2 2
3 Put real downloaded open-music audio files here before running the one-shot smoke flow. 3 ## One-screen summary
4 4
5 ## Recommended folders 5 | Dataset | Local directory | Primary use here | Current status | License note |
6 - `data/raw/fma_small_audio/` 6 |---|---|---|---|---|
7 - `data/raw/mtg_jamendo_audio/` 7 | FMA Small | [fma_small_audio/](./fma_small_audio/) | first real-data smoke + training baseline | not downloaded | track-level terms vary; verify before broader use |
8 | MTG-Jamendo | [mtg_jamendo_audio/](./mtg_jamendo_audio/) | retrieval/eval smoke and corpus compatibility checks | not downloaded | common usage path is research-oriented; do **not** treat as commercial-ready by default |
9
10 ## Recommended order
11
12 ```mermaid
13 flowchart LR
14 A[Download / place local audio] --> B[check-local-ready]
15 B --> C[inspect-local]
16 C --> D[smoke-local]
17 ```
18
19 ## Minimal commands
20
21 ### 1) Verify the folder is actually usable
8 22
9 ## Next command
10 For FMA:
11 ```bash 23 ```bash
12 /usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local fma data/raw/fma_small_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2 24 /usr/local/miniconda3/bin/python src/data/external_adapters.py check-local-ready fma data/raw/fma_small_audio --eval-ratio 0.2 --query-duration 8.0
25 /usr/local/miniconda3/bin/python src/data/external_adapters.py check-local-ready mtg_jamendo data/raw/mtg_jamendo_audio --eval-ratio 0.2 --query-duration 8.0
13 ``` 26 ```
14 27
15 For MTG-Jamendo: 28 ### 2) Run the full smoke pipeline
29
16 ```bash 30 ```bash
31 /usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local fma data/raw/fma_small_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2
17 /usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local mtg_jamendo data/raw/mtg_jamendo_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2 32 /usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local mtg_jamendo data/raw/mtg_jamendo_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2
18 ``` 33 ```
34
35 ## Git LFS policy
36
37 Large raw archives and audio under `data/raw/` are tracked through Git LFS via [/.gitattributes](../../.gitattributes).
38
39 ## Download / ingestion policy
40
41 1. Prefer **small, verifiable subsets first**.
42 2. Keep original archives under `data/raw/` only when they are truly needed for reproducibility.
43 3. For MTG-Jamendo, keep the corpus in an **evaluation/research** lane unless a verified commercial-safe subset is explicitly documented.
44 4. Record source URL, subset choice, and license evidence in docs before broadening training scope.
......
...@@ -222,6 +222,31 @@ ...@@ -222,6 +222,31 @@
222 - 交接信息更适合自动化和长期持续开发 222 - 交接信息更适合自动化和长期持续开发
223 223
224 224
225
226 ### Stage: 原始开放数据 LFS 治理
227
228 完成项:
229 - 新增根目录 [/.gitattributes](../.gitattributes)
230 -`acr-engine/data/raw/` 下的大文件与音频配置 Git LFS 跟踪策略
231 - 重写 [acr-engine/data/raw/README.md](../acr-engine/data/raw/README.md),补充:
232 - 数据落点职责表
233 - `check-local-ready -> smoke-local` 最短路径
234 - 原始数据下载 / LFS 治理策略
235 - 补充 [docs/dataset-sources-and-licensing.md](./dataset-sources-and-licensing.md) 的下载 / LFS 治理说明
236
237 验证结果:
238 - `git lfs version` 成功
239 - `git check-attr filter -- acr-engine/data/raw/fma_small_audio/example.wav` 返回 `filter: lfs`
240 - `git check-attr filter -- acr-engine/data/raw/archive.zip` 返回 `filter: lfs`
241 - 文档已明确区分:
242 - 工程可用性
243 - benchmark 适用性
244 - 商用可部署性
245
246 结论:
247 - 仓库现在具备承接真实开放音频和压缩包的 LFS 基础设施
248 - 后续下载真实数据时,不会直接把大文件塞进普通 git 历史
249
225 ### Stage: 真实数据就绪度守门 250 ### Stage: 真实数据就绪度守门
226 251
227 完成项: 252 完成项:
......
...@@ -96,3 +96,35 @@ flowchart LR ...@@ -96,3 +96,35 @@ flowchart LR
96 96
97 ## Sources 97 ## Sources
98 - See [references-and-sources.md](./references-and-sources.md) for the current source map. 98 - See [references-and-sources.md](./references-and-sources.md) for the current source map.
99
100
101 ## Download / LFS governance
102
103 ### Preferred repository behavior
104
105 ```mermaid
106 flowchart TD
107 A[Upstream dataset source] --> B[Local raw drop zone]
108 B --> C[Git LFS tracked large files]
109 C --> D[check-local-ready]
110 D --> E[prepare-local / smoke-local]
111 ```
112
113 ### Current repo policy
114
115 | Item | Policy | Reason |
116 |---|---|---|
117 | `acr-engine/data/raw/**/*.zip` | Git LFS | avoid bloating normal git history |
118 | `acr-engine/data/raw/**/*.wav` / `.mp3` / `.flac` / `.ogg` | Git LFS | allow local reproducibility without normal blob explosion |
119 | FMA Small | acceptable as first real-data engineering baseline | easiest realistic open music smoke path |
120 | MTG-Jamendo | default to research/eval lane | do not assume commercial-safe rights without subset-specific proof |
121
122 ### Operational note
123
124 Even when a dataset is technically downloadable, this project should separate:
125
126 - **engineering usability**
127 - **benchmark suitability**
128 - **commercial deployment suitability**
129
130 These are not the same thing.
......