Fine-Tuning Research · 3 Experiments
Fine-tuning a 32B language model to generate mythology-inspired Midjourney prompts, analyze cinematography, and direct visual storytelling. Three runs across Azure AI Foundry and GCP Vertex AI — each building on the last.
# Qwen3-32B Creative Director — 3 fine-tuning runs
$ vertex-ai train // Run 0
step 100 | eval 1.399 | acc 66.7%
step 295 | eval 1.185 | acc 70.3%
[DONE] 295 steps · ↓44.7% from baseline
$ azure-ai train --epochs 1 // Run 1
step 782 | loss 1.126 | eval 1.195
[DONE] 50min · 0.99 PF
$ azure-ai train --epochs 2 // Run 2
step 782 | loss 1.126 | eval 1.195
step 1564 | loss 1.195 | eval 1.017
[DONE] 99min · 1.98 PF · best eval ↓14.8%
Run 0
GCP Vertex AI
Steps
295
~1 epochs
Eval Loss
1.185
70.3% token acc.
295 steps · 44.7% reduction from start
Run 1
Azure AI Foundry
Steps
782
1 epoch
Eval Loss
1.195
1 epoch · baseline
Run 2
Azure AI Foundry
Steps
1,564
2 epochs
Eval Loss
1.017
2 epochs · best loss, −14.8% vs Run 1 eval
Two platforms, three runs. Run 0 on GCP Vertex AI. Runs 1 & 2 on Azure AI Foundry with Qwen3-32B.
Run 0 hyperparameters not exported by Vertex AI — platform manages LR schedule and batch size internally.
15,434 mythology-focused training pairs built from Midjourney session data. Used across all three experiments — format varies by platform.
Raw Pairs
18,431
Before deduplication
Duplicates Removed
2,997
16.3% of raw data
Train Split
13,890
90% of unique pairs
Validation Split
1,544
10% held out
31.1%
Liked (is_liked)
3,293 pairs
4.2%
Favorites (rating=5)
444 pairs
31.1%
Rated 4+
3,293 pairs
63.3%
Unrated (rating=0)
6,699 pairs
5.5%
Disliked (rating=1)
585 pairs
Eval loss and accuracy across all three runs. Run 2 holds the best eval loss. Run 0 is the first to track token-level accuracy.
The first experiment — GCP Vertex AI, 295 steps. A different platform with internally managed hyperparameters, and the only run to track token-level prediction accuracy. 44.7% eval loss reduction from starting baseline.
Starting Eval Loss
2.144
Step 1 baseline
Final Eval Loss
1.185
Step 295 — still improving
Eval Improvement
44.7%
Step 1 → step 295
Token Accuracy
70.3%
Fraction correct next-step preds
Single-epoch baseline on Azure AI Foundry — 782 steps, constant LR 1e-4. The model rapidly acquires mythology vocabulary and Midjourney parameter structure. Establishes the foundation for Run 2.
Starting Loss
2.564
Step 1, before any training
Train End Loss
1.126
Step 782, final train step
Eval Loss
1.195
Epoch 1 checkpoint
Loss Reduction
56.5%
From 2.564 to 1.195
Extended to 2 epochs on the same dataset and hyperparameters. Epoch 2 improved eval loss by 14.8% over epoch 1 — the model consolidated style patterns rather than memorizing. Best overall eval loss across all three runs.
Starting Loss
2.564
Shared baseline with Run 1
Epoch 1 Eval
1.195
Step 782 checkpoint
Epoch 2 Eval
1.017
Step 1,564 — best across all runs
Epoch Improvement
14.8%
Epoch 1 → Epoch 2 eval
Patterns that held across all three runs, and the persistent issues that need to be resolved before Run 3.
56.5% loss reduction. The model rapidly acquired domain vocabulary — mythology characters, cinematography terms, Midjourney parameters.
Despite only 8.2% train loss reduction, eval loss improved 14.8%. Epoch 2 consolidated patterns rather than memorizing.
Both runs produced nearly identical metrics through step 782. Training on Azure AI Foundry is deterministic.
8.24–8.43 samples/s is reasonable for a 32B parameter model, indicating efficient batch utilization.
LR jumps from 0 to 1e-4 at step 2. Initial gradient norm (2.125) is high. A 10% warmup (~156 steps) would smooth early training.
Flat 1e-4 means equal step sizes throughout. Cosine decay would stabilize later training and address the plateau at steps 1100–1200.
Average grad norm increased 49% (1.592 → 2.379). 37 spikes >3.0 suggest unnecessarily large updates. Gradient clipping (max_norm=1.0) recommended.
Train loss (1.195) > eval loss (1.017) is atypical. Possible: val set is easier, train includes noisier examples (63.3% unrated), or insufficient eval checkpoints.
Only end-of-epoch evaluation. Cannot determine where optimal checkpoint lies within each epoch. Best practice: every 100–200 steps.
Most examples have no quality signal. Including disliked prompts (5.5%) may teach the model to produce suboptimal outputs.
Only 291/15,434 pairs (1.9%) are video-related. If video analysis is a key use case, target 5–10% representation.
No multi-turn examples. The model won't learn iterative refinement or follow-up feedback — critical for a creative director role.
All examples use the same 1,175-char system prompt. Creates tight coupling — model may underperform with any variation.
Addressing the training instability, data quality gaps, and evaluation gaps identified across Runs 0–2.
Held-out test set — Curated 100–200 examples covering all task types — not in training or validation
Task-specific eval — Track loss separately for prompt gen, image analysis, video replication, and refinement tasks
Human eval baseline — Score 50 outputs on accuracy, mythology depth, style, and production-readiness
A/B test vs base — Compare fine-tuned Qwen3-32B vs base model vs Gemini on identical prompts
Three blockers remain before Run 3 can start. Data pipeline and format conversion are the critical path.
21,318 images + 8,586 videos in GCS
15,434 pairs (Gemini format)
817/817 transferred to GCS
Run build_multimodal_dataset.py
Write convert_to_azure.py
Filter to liked/rated only