Fine-Tuning Research · 3 Experiments

Qwen3-32B
Creative Director

Fine-tuning a 32B language model to generate mythology-inspired Midjourney prompts, analyze cinematography, and direct visual storytelling. Three runs across Azure AI Foundry and GCP Vertex AI — each building on the last.

training-history.log

# Qwen3-32B Creative Director — 3 fine-tuning runs

$ vertex-ai train                   // Run 0

  step 100 | eval 1.399 | acc 66.7%

  step 295 | eval 1.185 | acc 70.3%

  [DONE] 295 steps · ↓44.7% from baseline

$ azure-ai train --epochs 1          // Run 1

  step 782 | loss 1.126 | eval 1.195

  [DONE] 50min · 0.99 PF

$ azure-ai train --epochs 2          // Run 2

  step 782 | loss 1.126 | eval 1.195

  step 1564 | loss 1.195 | eval 1.017

  [DONE] 99min · 1.98 PF · best eval ↓14.8%

Run 0

GCP Vertex AI

Steps

295

~1 epochs

Eval Loss

1.185

70.3% token acc.

295 steps · 44.7% reduction from start

Run 1

Azure AI Foundry

Steps

782

1 epoch

Eval Loss

1.195

1 epoch · baseline

Run 2

Azure AI Foundry

Best Loss

Steps

1,564

2 epochs

Eval Loss

1.017

2 epochs · best loss, −14.8% vs Run 1 eval

01

Experiment Setup

Two platforms, three runs. Run 0 on GCP Vertex AI. Runs 1 & 2 on Azure AI Foundry with Qwen3-32B.

Runs 1 & 2

Azure AI Foundry

Base modelQwen3-32B
PlatformAzure AI Foundry
FormatOpenAI chat completions
Rate limits50K TPM / 50 RPM
Run 0

GCP Vertex AI

PlatformGCP Vertex AI
Experiment IDtuning-experiment-20260210130258433553
Total steps295
Eval checkpointsSteps 1, 100, 200, 295

Training Hyperparameters — All Runs

ParameterRun 0Run 1Run 2
Epochs~112
Learning rate1e-4 (constant)1e-4 (constant)
LR scheduleNone (flat)None (flat)
Warmup0 steps0 steps
Total steps2957821,564
Runtime~50 min~99 min
Throughput8.24 samples/s8.43 samples/s
Total FLOPs0.99 PF1.98 PF

Run 0 hyperparameters not exported by Vertex AI — platform manages LR schedule and batch size internally.

02

Dataset

15,434 mythology-focused training pairs built from Midjourney session data. Used across all three experiments — format varies by platform.

Raw Pairs

18,431

Before deduplication

Duplicates Removed

2,997

16.3% of raw data

Train Split

13,890

90% of unique pairs

Validation Split

1,544

10% held out

Token Statistics per Example

ComponentAvgMinMedianMax
System prompt294294294294
User message385373690
Model response161121352422
Total4943144664012

Quality Signals

31.1%

Liked (is_liked)

3,293 pairs

4.2%

Favorites (rating=5)

444 pairs

31.1%

Rated 4+

3,293 pairs

63.3%

Unrated (rating=0)

6,699 pairs

5.5%

Disliked (rating=1)

585 pairs

03

Results Summary

Eval loss and accuracy across all three runs. Run 2 holds the best eval loss. Run 0 is the first to track token-level accuracy.

RunPlatformSteps / EpochsBest Eval LossNote
Run 0GCP Vertex AI295 / ~1
1.18570.3% acc
295 steps · 44.7% reduction from start
Run 1Azure AI Foundry782 / 1
1.195
1 epoch · baseline
Run 2Azure AI Foundry1,564 / 2
1.017lowest eval loss
2 epochs · best loss, −14.8% vs Run 1 eval
04GCP Vertex AI

Run 0

The first experiment — GCP Vertex AI, 295 steps. A different platform with internally managed hyperparameters, and the only run to track token-level prediction accuracy. 44.7% eval loss reduction from starting baseline.

Starting Eval Loss

2.144

Step 1 baseline

Final Eval Loss

1.185

Step 295 — still improving

Eval Improvement

44.7%

Step 1 → step 295

Token Accuracy

70.3%

Fraction correct next-step preds

Eval Checkpoints

StepEval LossReduction from StartToken Accuracy
12.144155.9%
1001.3993↓ 34.7%66.7%
2001.2703↓ 40.8%68.8%
2951.1853↓ 44.7%70.3%

Convergence Phases

Steps 1–502.364 → 1.585Rapid initial descent, domain vocabulary acquisition
Steps 51–1001.585 → 1.486Plateau with steady refinement
Steps 101–1901.486 → 1.207Continued learning, style consolidation
Steps 191–2761.207 → 1.114Best performance, fine-grained pattern capture
Steps 277–2951.114 → 1.245Slight uptick at end of run
05Azure AI Foundry

Run 1

Single-epoch baseline on Azure AI Foundry — 782 steps, constant LR 1e-4. The model rapidly acquires mythology vocabulary and Midjourney parameter structure. Establishes the foundation for Run 2.

Starting Loss

2.564

Step 1, before any training

Train End Loss

1.126

Step 782, final train step

Eval Loss

1.195

Epoch 1 checkpoint

Loss Reduction

56.5%

From 2.564 to 1.195

06Best Eval Loss

Run 2

Extended to 2 epochs on the same dataset and hyperparameters. Epoch 2 improved eval loss by 14.8% over epoch 1 — the model consolidated style patterns rather than memorizing. Best overall eval loss across all three runs.

Starting Loss

2.564

Shared baseline with Run 1

Epoch 1 Eval

1.195

Step 782 checkpoint

Epoch 2 Eval

1.017

Step 1,564 — best across all runs

Epoch Improvement

14.8%

Epoch 1 → Epoch 2 eval

Gradient Norm — Epoch 1

Average1.592
Min0.585
Max7.807

Gradient Norm — Epoch 2 +49%

Average2.379
Min1.203
Max6.214
Spikes > 3.037 (4.7%)

100-Step Convergence Phases

Steps 1–1002.564 → 1.520Rapid domain vocabulary acquisition
Steps 101–2001.520 → 1.397Style and structure learning
Steps 201–3001.397 → 1.422Pattern consolidation
Steps 301–4001.422 → 1.376Plateau with micro-gains
Steps 401–5001.376 → 1.461Fluctuation — data variety impact
Steps 501–6001.461 → 1.343Continued refinement
Steps 601–7001.343 → 1.299Diminishing returns
Steps 701–7821.299 → 1.126Late surge before eval
07

Analysis

Patterns that held across all three runs, and the persistent issues that need to be resolved before Run 3.

What Worked

Strong Epoch 1 Learning

56.5% loss reduction. The model rapidly acquired domain vocabulary — mythology characters, cinematography terms, Midjourney parameters.

Epoch 2 Generalization

Despite only 8.2% train loss reduction, eval loss improved 14.8%. Epoch 2 consolidated patterns rather than memorizing.

Reproducibility

Both runs produced nearly identical metrics through step 782. Training on Azure AI Foundry is deterministic.

Efficient Throughput

8.24–8.43 samples/s is reasonable for a 32B parameter model, indicating efficient batch utilization.

Concerns

training

No Warmup Phase

LR jumps from 0 to 1e-4 at step 2. Initial gradient norm (2.125) is high. A 10% warmup (~156 steps) would smooth early training.

training

Constant Learning Rate

Flat 1e-4 means equal step sizes throughout. Cosine decay would stabilize later training and address the plateau at steps 1100–1200.

training

Gradient Instability (Epoch 2)

Average grad norm increased 49% (1.592 → 2.379). 37 spikes >3.0 suggest unnecessarily large updates. Gradient clipping (max_norm=1.0) recommended.

training

Train-Eval Gap Inversion

Train loss (1.195) > eval loss (1.017) is atypical. Possible: val set is easier, train includes noisier examples (63.3% unrated), or insufficient eval checkpoints.

training

Eval Frequency Too Low

Only end-of-epoch evaluation. Cannot determine where optimal checkpoint lies within each epoch. Best practice: every 100–200 steps.

data

63.3% Unrated Training Data

Most examples have no quality signal. Including disliked prompts (5.5%) may teach the model to produce suboptimal outputs.

data

Video Underrepresentation

Only 291/15,434 pairs (1.9%) are video-related. If video analysis is a key use case, target 5–10% representation.

data

Single-Turn Only

No multi-turn examples. The model won't learn iterative refinement or follow-up feedback — critical for a creative director role.

data

Uniform System Prompt

All examples use the same 1,175-char system prompt. Creates tight coupling — model may underperform with any variation.

08

Run 3 Recommendations

Addressing the training instability, data quality gaps, and evaluation gaps identified across Runs 0–2.

ParameterCurrent (Runs 1 & 2)RecommendedRationale
Epochs23–4Eval loss still improving; extend to find plateau
Learning rate1e-45e-5Lower peak to reduce gradient instability
LR scheduleConstantCosine decayAggressive early, stable later
Warmup0%10% (~156 steps)Smooth initial training
Eval frequencyPer epochEvery 200 stepsFind optimal checkpoint
CheckpointPer epochBest eval lossSave the actual best model
Grad clippingDefaultmax_norm=1.0Cap 4.7% of grad spikes
ChangeImpactPriority
Quality filter: liked/rated only15,434 → ~5,200 pairs. Remove 63% unrated + 5.5% dislikedHIGH
Build multimodal dataset+500–800 image+analysis pairs from 817 GCS imagesHIGH
Add multi-turn examplesTeaches iterative refinement (critical for creative director)MEDIUM
Augment video dataCurrently 1.9% → target 5–10% of training dataMEDIUM
Vary system prompts3–5 variants to reduce overfitting to exact wordingLOW
01

Held-out test setCurated 100–200 examples covering all task types — not in training or validation

02

Task-specific evalTrack loss separately for prompt gen, image analysis, video replication, and refinement tasks

03

Human eval baselineScore 50 outputs on accuracy, mythology depth, style, and production-readiness

04

A/B test vs baseCompare fine-tuned Qwen3-32B vs base model vs Gemini on identical prompts

09

Infrastructure

Three blockers remain before Run 3 can start. Data pipeline and format conversion are the critical path.

Raw Midjourney Data

21,318 images + 8,586 videos in GCS

ready

Training Pairs

15,434 pairs (Gemini format)

ready

Image Manifest

817/817 transferred to GCS

ready

Multimodal Dataset

Run build_multimodal_dataset.py

pending

Azure Format Conversion

Write convert_to_azure.py

blocked

Quality-Filtered Dataset

Filter to liked/rated only

pending