Keep raw open-music assets out of normal git history
Constraint: The user wants real datasets added locally and potentially pushed, which would make ordinary git history fragile without LFS guardrails Rejected: Download first and retrofit tracking later | Risks oversized commits and inconsistent reproducibility rules Confidence: high Scope-risk: narrow Directive: Route all future raw corpus archives and audio under acr-engine/data/raw through LFS unless a smaller manifest-only alternative is explicitly chosen Tested: git lfs version; git check-attr filter -- acr-engine/data/raw/fma_small_audio/example.wav; git check-attr filter -- acr-engine/data/raw/archive.zip Not-tested: Actual large-file add/push against remote LFS storage remains pending until real dataset files are downloaded
Showing
4 changed files
with
100 additions
and
9 deletions
.gitattributes
0 → 100644
| 1 | acr-engine/data/raw/**/*.zip filter=lfs diff=lfs merge=lfs -text | ||
| 2 | acr-engine/data/raw/**/*.tar filter=lfs diff=lfs merge=lfs -text | ||
| 3 | acr-engine/data/raw/**/*.tar.gz filter=lfs diff=lfs merge=lfs -text | ||
| 4 | acr-engine/data/raw/**/*.tgz filter=lfs diff=lfs merge=lfs -text | ||
| 5 | acr-engine/data/raw/**/*.wav filter=lfs diff=lfs merge=lfs -text | ||
| 6 | acr-engine/data/raw/**/*.mp3 filter=lfs diff=lfs merge=lfs -text | ||
| 7 | acr-engine/data/raw/**/*.flac filter=lfs diff=lfs merge=lfs -text | ||
| 8 | acr-engine/data/raw/**/*.ogg filter=lfs diff=lfs merge=lfs -text |
| 1 | # Local Open-Music Drop Zones | 1 | # Raw Open Music Drop Zones |
| 2 | 2 | ||
| 3 | Put real downloaded open-music audio files here before running the one-shot smoke flow. | 3 | ## One-screen summary |
| 4 | 4 | ||
| 5 | ## Recommended folders | 5 | | Dataset | Local directory | Primary use here | Current status | License note | |
| 6 | - `data/raw/fma_small_audio/` | 6 | |---|---|---|---|---| |
| 7 | - `data/raw/mtg_jamendo_audio/` | 7 | | FMA Small | [fma_small_audio/](./fma_small_audio/) | first real-data smoke + training baseline | not downloaded | track-level terms vary; verify before broader use | |
| 8 | | MTG-Jamendo | [mtg_jamendo_audio/](./mtg_jamendo_audio/) | retrieval/eval smoke and corpus compatibility checks | not downloaded | common usage path is research-oriented; do **not** treat as commercial-ready by default | | ||
| 9 | |||
| 10 | ## Recommended order | ||
| 11 | |||
| 12 | ```mermaid | ||
| 13 | flowchart LR | ||
| 14 | A[Download / place local audio] --> B[check-local-ready] | ||
| 15 | B --> C[inspect-local] | ||
| 16 | C --> D[smoke-local] | ||
| 17 | ``` | ||
| 18 | |||
| 19 | ## Minimal commands | ||
| 20 | |||
| 21 | ### 1) Verify the folder is actually usable | ||
| 8 | 22 | ||
| 9 | ## Next command | ||
| 10 | For FMA: | ||
| 11 | ```bash | 23 | ```bash |
| 12 | /usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local fma data/raw/fma_small_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2 | 24 | /usr/local/miniconda3/bin/python src/data/external_adapters.py check-local-ready fma data/raw/fma_small_audio --eval-ratio 0.2 --query-duration 8.0 |
| 25 | /usr/local/miniconda3/bin/python src/data/external_adapters.py check-local-ready mtg_jamendo data/raw/mtg_jamendo_audio --eval-ratio 0.2 --query-duration 8.0 | ||
| 13 | ``` | 26 | ``` |
| 14 | 27 | ||
| 15 | For MTG-Jamendo: | 28 | ### 2) Run the full smoke pipeline |
| 29 | |||
| 16 | ```bash | 30 | ```bash |
| 31 | /usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local fma data/raw/fma_small_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2 | ||
| 17 | /usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local mtg_jamendo data/raw/mtg_jamendo_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2 | 32 | /usr/local/miniconda3/bin/python src/data/external_adapters.py smoke-local mtg_jamendo data/raw/mtg_jamendo_audio --output-root data/external_smoke --eval-ratio 0.2 --query-duration 8.0 --train-epochs 1 --batch-size 2 |
| 18 | ``` | 33 | ``` |
| 34 | |||
| 35 | ## Git LFS policy | ||
| 36 | |||
| 37 | Large raw archives and audio under `data/raw/` are tracked through Git LFS via [/.gitattributes](../../.gitattributes). | ||
| 38 | |||
| 39 | ## Download / ingestion policy | ||
| 40 | |||
| 41 | 1. Prefer **small, verifiable subsets first**. | ||
| 42 | 2. Keep original archives under `data/raw/` only when they are truly needed for reproducibility. | ||
| 43 | 3. For MTG-Jamendo, keep the corpus in an **evaluation/research** lane unless a verified commercial-safe subset is explicitly documented. | ||
| 44 | 4. Record source URL, subset choice, and license evidence in docs before broadening training scope. | ... | ... |
| ... | @@ -222,6 +222,31 @@ | ... | @@ -222,6 +222,31 @@ |
| 222 | - 交接信息更适合自动化和长期持续开发 | 222 | - 交接信息更适合自动化和长期持续开发 |
| 223 | 223 | ||
| 224 | 224 | ||
| 225 | |||
| 226 | ### Stage: 原始开放数据 LFS 治理 | ||
| 227 | |||
| 228 | 完成项: | ||
| 229 | - 新增根目录 [/.gitattributes](../.gitattributes) | ||
| 230 | - 为 `acr-engine/data/raw/` 下的大文件与音频配置 Git LFS 跟踪策略 | ||
| 231 | - 重写 [acr-engine/data/raw/README.md](../acr-engine/data/raw/README.md),补充: | ||
| 232 | - 数据落点职责表 | ||
| 233 | - `check-local-ready -> smoke-local` 最短路径 | ||
| 234 | - 原始数据下载 / LFS 治理策略 | ||
| 235 | - 补充 [docs/dataset-sources-and-licensing.md](./dataset-sources-and-licensing.md) 的下载 / LFS 治理说明 | ||
| 236 | |||
| 237 | 验证结果: | ||
| 238 | - `git lfs version` 成功 | ||
| 239 | - `git check-attr filter -- acr-engine/data/raw/fma_small_audio/example.wav` 返回 `filter: lfs` | ||
| 240 | - `git check-attr filter -- acr-engine/data/raw/archive.zip` 返回 `filter: lfs` | ||
| 241 | - 文档已明确区分: | ||
| 242 | - 工程可用性 | ||
| 243 | - benchmark 适用性 | ||
| 244 | - 商用可部署性 | ||
| 245 | |||
| 246 | 结论: | ||
| 247 | - 仓库现在具备承接真实开放音频和压缩包的 LFS 基础设施 | ||
| 248 | - 后续下载真实数据时,不会直接把大文件塞进普通 git 历史 | ||
| 249 | |||
| 225 | ### Stage: 真实数据就绪度守门 | 250 | ### Stage: 真实数据就绪度守门 |
| 226 | 251 | ||
| 227 | 完成项: | 252 | 完成项: | ... | ... |
| ... | @@ -96,3 +96,35 @@ flowchart LR | ... | @@ -96,3 +96,35 @@ flowchart LR |
| 96 | 96 | ||
| 97 | ## Sources | 97 | ## Sources |
| 98 | - See [references-and-sources.md](./references-and-sources.md) for the current source map. | 98 | - See [references-and-sources.md](./references-and-sources.md) for the current source map. |
| 99 | |||
| 100 | |||
| 101 | ## Download / LFS governance | ||
| 102 | |||
| 103 | ### Preferred repository behavior | ||
| 104 | |||
| 105 | ```mermaid | ||
| 106 | flowchart TD | ||
| 107 | A[Upstream dataset source] --> B[Local raw drop zone] | ||
| 108 | B --> C[Git LFS tracked large files] | ||
| 109 | C --> D[check-local-ready] | ||
| 110 | D --> E[prepare-local / smoke-local] | ||
| 111 | ``` | ||
| 112 | |||
| 113 | ### Current repo policy | ||
| 114 | |||
| 115 | | Item | Policy | Reason | | ||
| 116 | |---|---|---| | ||
| 117 | | `acr-engine/data/raw/**/*.zip` | Git LFS | avoid bloating normal git history | | ||
| 118 | | `acr-engine/data/raw/**/*.wav` / `.mp3` / `.flac` / `.ogg` | Git LFS | allow local reproducibility without normal blob explosion | | ||
| 119 | | FMA Small | acceptable as first real-data engineering baseline | easiest realistic open music smoke path | | ||
| 120 | | MTG-Jamendo | default to research/eval lane | do not assume commercial-safe rights without subset-specific proof | | ||
| 121 | |||
| 122 | ### Operational note | ||
| 123 | |||
| 124 | Even when a dataset is technically downloadable, this project should separate: | ||
| 125 | |||
| 126 | - **engineering usability** | ||
| 127 | - **benchmark suitability** | ||
| 128 | - **commercial deployment suitability** | ||
| 129 | |||
| 130 | These are not the same thing. | ... | ... |
-
Please register or sign in to post a comment