dataproc4: Notes on Reuse and Ruins
“we built this before Lepton, before R2, before anyone stopped us.”
📍 Why This Exists
You’re trying to understand what dataproc4 solved, and what parts might still be worth using. This document won’t walk you through every node—it’s meant to recap core ideas, point to reusable structures, and mark what didn’t age well.
You can now plug the full codebase into an LLM. I’m not competing with that. Think of this as a map of intentions: what the system meant to do, not just what it does.
Recap: Dataproc4 Architecture
(更详细的信息可以看lark文档): Lark Doc
pipeline整体大致可以分成几个stage, 具体实现背后的思考会在后面继续:
- merge meta: 爬虫下载 → 存到s3 → 合并成metadata parquet (frozen,
*.todo.parquet) - prefiltering: 根据已知的metadata, 初步筛选去掉不想要的图片 (eg. 85% artworks on pixiv are amateurs)
- calculate metrics: 输入todo parquet, 用sagemaker计算metrics, 存到
*.[metric].parquet - merge metrics: 合并所有parts得到每个数据集的单个parquet
- merge datasets: 合并所有数据集; 合并不同数据集里的相同图片
- apply filtering & assemble prompts: 根据上面的metrics筛选数据集, 然后组装每行的prompts
- export dataset: 保存到支持file versioning的s3 (bucket-data-catalog)
flowchart TD
─────────────────────────────── INGEST & METRICS ────────────────────────────
UpdateTodos["Update Todos<br/>(inventory diff → todo.parquet)"]
SageMaker["SageMaker Jobs<br/>(clip scores, softmax, taggers)"]
Sources --> UpdateTodos --> SageMaker
──────────────────────────────── PREFILTER ────────────────────────────────
Prefilter["Prefilter Data<br/>(clip ≥ 1.2, softmax, general rows)"]
S3Merged --> Prefilter
──────────────────────────────── DEDUPE ────────────────────────────────────
DedupIntra["Dedup (intra‑dataset)"]
DedupCross["Dedup (cross‑dataset)"]
Augment --> DedupIntra --> DedupCross
──────────────────────────────── LATENTS ───────────────────────────────────
Encode["Encode Latent<br/>(Lepton)"]
Sanity --> Encode --> Training["Training Pipeline"]
%% ──────────────────────────────── DEMO ─────────────────────────────────────
Training --> Demo["Gradio / Comfy Demo"]
✅ What’s Still Usable
因为数据存储格式和平台变化 (现在用webdataset, 以及H100 cluster), 很多之前相应的内容都不用了 (…比如把文件分很小来防止g5.xllarge内存不够用);
但是大致的design pattern还是可以继续参考的:
🧭 Namespacing and Per-Source Merging
dataproc4主要用来处理anime图片, 可以对数据做这些假设:
- 同一张图片会出现在不同的平台 (twitter, danbooru, pixiv)
- 不是所有平台的数据质量都同样可靠 (gelbooru的tag数比danbooru少)
- 不同平台对图片内容有各自的偏好 (danbooru喜欢nsfw)
所以我们可以:
- 给每张图片assign一个unique id, 在不同源之间保持相同 (可以根据image embedding)
- 在合并不同源meta时, 以image id作为主键, 每个源的metadata添加
{source}__namespace来防止碰撞 - 按照不同源的信息可信度来合并元数据; merge metadata fields (tags, ratings, etc.) probabilistically, with source-weighted logic.
这样可以把更重要的信息放在prompt的开头, 把不确定的信息放在更后面, 对clip-based的模型来说会更有利于学习 (和prompting); 同时也可以在之后用来控制组装prompt的长度.
Where to look: merge_meta_tag_pipeline.py — check how source_weight_map is passed and applied during merge.
🎯 Deterministic Behaviors
Dataproc4 使用了Kedro框架来做开发, kedro非常注重的一点是代码的可复现 / 可复用:
- enforced static configs:
- 所有传入pipeline的变量都记录在yaml文件里, 而非hard-coded (鼓励可复用/可检查的代码)
- data as pipeline artifact:
- 以pipeline作为单位开发, 而非按步骤
- 数据是pipeline运行的结果, 所以维护的是用来产生数据的pipeline代码, 而非数据本身.
- deterministic behavior:
- Kedro希望data pipeilne在运行之前就是确定的状态:
- pipeline (最好) 不要存在能改变pipeline本身的功能 (eg. pipeline that generates pipelines)
- (与之相反的设计模式叫做 runtime dynamism: 管线会更灵活, 但是可追溯性降低了)
Where to look: pipeline本身
📓 Notebook-First, Pipeline-Second, Dataset-Third
Dataproc4 的开发步骤大概是这样的:
- 先用流程图确定大致数据处理的流程, 减少每步之间的依赖 (define variables)
- 在 notebook 里写原型和实验代码 (eg. loading data, or doing transforms)
- 把稳定的函数从 notebook 移动到代码库里 (extract to kedro code)
- 把剩余的 notebook 转换成 pipeline (parameters/catalog/DAG)
这样做的好处是保留了平时notebook里写代码的灵活性, 同时让pipeline本身相对整洁.
对于生产数据集:
-
每次修改修改pipeline之后重新运行, 生成新的dataset;
-
数据集是管线运行的结果, 而不是目的. 尽量对pipeline做修改, 而不是直接修改dataset; 这样可以保证可复用性, 也减少了之后回头检查的时候需要的工作量.
(Notes on AI use):
- 考虑到cursor对notebook的支持不算很好, 这里对于要不要继续用notebook开发可能会有取舍. 但还是建议先在单独的文件里开发, 稳定之后再整合进pipeline.
- 有AI辅助写 boilerplate / pipeline 定义, 或者修改yaml还是很有用的, 只是不建议直接用agent mode直接修改pipeline本身; 因为某些小地方错了会很难debug, 按模块测试之后再整合会更稳定一点.
- 可以考虑
See also: the merge metrics pipeline came almost fully from a notebook named tagger_quality_sanitycheck_v5.ipynb. That notebook is better than most meetings.
🔁 Checkpointing and Resume Logic
在前面的部分大致列出了pipeline stages和流程图.
大致来说, 把每个stage设计为相对独立的步骤, 这样某个步骤出错或做了修改 可以直接从前面的stage缓存的结果继续.
- Again, we want to maintain the pipeline intead of hand-crafting data itself. This ensures observability and smooth future reruns.
- It’s slightly more invested upfront, but planning how data will be processed before doing it will save time in the long run, since we can oftern find re-usale steps and shortcuts.
Checkpointing的实现可以有多种办法:
- dataproc4 用了 versioned s3 bucket 来存储不同 stage之间的artifact, 好处是快+稳定 缺点是没有预览, 每次要用的时候都要去s3找路径然后load.
- huggingface dataset 也是一个不错的选择, 虽然稍微慢一点 但可以用
HfApi直接上传parquet 另外: hugginface数据集会有一个Dataset Viewer, 在做处理的时候会挺有用的 - 直接在本地存很多parquet其实也是一种checkponting, 但是数量多了以后会很难记得每一个的用处, 和其他人共享 (或者用其他人处理过的数据) 也会比较困难. 对正在处理的数据来说 本地多存也是有好处的
(Some Emoji) DAG / Pipeline Approach to Data:
As stated before, we see data as artifact, and iterate on pipline.
We want to reduce the manual loop-back steps, and make sure that data stages do not overwrite each other.
Say for example, we’re working on adding filters to cleaning. then found out that we need an extra metric to do it better. We do:
- explore, produce experimental changes and validate in notebook
- create reusable functions as nodes
- chaining nodes to create pipelines.
We do not:
- make manual changes, then overwrite a dataset in the catalog. (intead, rerun the modified pipeline)
- create dataset in notebooks, then use the data as-is. (larger training datasets should be created from pipelines, or at least mostly from it)
Data gradually gets stripped, aggregated, refined, validated, following the flow, until it turns into training data.
这样做的另一个好处是方便和其他人一起合作:
- 当pipeline有明确的步骤 / checkpoint 的时候, 可以安排不同人在相同的基础上开发.
- 比起一个人写20个notebook逐渐向一个parquet里添加数据 这样对个人的负担也会更小一点.
❌ What Didn’t Age Well
在写dataproc4的时候 Lepton / H100 cluster 还不存在; 当时针对 AWS 特性做的一些 hack 现在可能变得不适用了:
🐌 Memory-Conscious Chunking
Originally designed for g5.xlarge (16GB RAM, spot instance volatility). We:
- Chunked work into RAM-safe slices
- Retried failures on resume
- Tuned everything to get under 85% memory ceiling
Because AWS is a high bandwidth, low compute platform. We utilize concurrency on S3 for efficiency.
Now we’re on 8xH100 nodes with low bandwidth, high compute environment. Chunking metadata to smaller parts isn’t necessary anymore, since we can just load everything into 2TB of RAM.
The chunk logic is elegant, but introduces overhead. That said—some of it is hard to reinvent cleanly, especially the retry/resume mechanics.
📦 WebDataset Retrofitting (Which We… Didn’t)
We shifted from per-image processing to WebDataset tarballs. But:
- Dataproc4 expects file paths for metrics calculation step
- Metadata loaders weren’t tar-aware
- Validation assumed directory lifecycles
In short: WebDataset was a structural refactor that happened late, and I was in waist high water during that change. The gather meta part might need changes (but the merge meta part might be salvageable).
🔍 What’s Underneath (But Not Fully Explained Yet)
This document focused on reusability. It leaves out:
- The full Kedro layout (how we balanced modularity + performance)
- How we hacked dynamic, context-aware arguments while pretending everything was declarative
- Training pipelines integrating from dataset inspection
- Our abuse of partial functions and pipeline templates to do runtime DAG construction
- And why the whole thing mostly didn’t catch fire
These topics belong to something longer. Probably called “Episode 2.”
🏁 Final Notes
dataproc4 worked. It scaled. It held together across five contributors and hundreds of millions of samples. It thrived under hostile environment, kept traceability, enabled cooperation, and didn’t collapse under scope creep.
Some parts were infra hacks that you no longer need. But others—naming discipline, approaches to pipelining, metadata merging, modular but inspectable logic—are worth keeping alive.
Reuse what pushes back against entropy. Let the rest rot in peace.
If you decide to go in deeper, I’ll be here for Episode 2.