Fine-Tuning Research · 3 Experiments

Qwen3-32B
Creative Director

Fine-tuning a 32B language model to generate mythology-inspired Midjourney prompts, analyze cinematography, and direct visual storytelling. Three runs across Azure AI Foundry and GCP Vertex AI — each building on the last.

training-history.log

# Qwen3-32B Creative Director — 3 fine-tuning runs

$ vertex-ai train // Run 0

step 100 | eval 1.399 | acc 66.7%

step 295 | eval 1.185 | acc 70.3%

[DONE] 295 steps · ↓44.7% from baseline

$ azure-ai train --epochs 1 // Run 1

step 782 | loss 1.126 | eval 1.195

[DONE] 50min · 0.99 PF

$ azure-ai train --epochs 2 // Run 2

step 782 | loss 1.126 | eval 1.195

step 1564 | loss 1.195 | eval 1.017

[DONE] 99min · 1.98 PF · best eval ↓14.8%

Run 0

GCP Vertex AI

Steps

295

~1 epochs

Eval Loss

1.185

70.3% token acc.

295 steps · 44.7% reduction from start

Run 1

Azure AI Foundry

Steps

782

1 epoch

Eval Loss

1.195

1 epoch · baseline

Run 2

Azure AI Foundry

Best Loss

Steps

1,564

2 epochs

Eval Loss

1.017

2 epochs · best loss, −14.8% vs Run 1 eval

Experiment Setup

Two platforms, three runs. Run 0 on GCP Vertex AI. Runs 1 & 2 on Azure AI Foundry with Qwen3-32B.

Runs 1 & 2

Azure AI Foundry

Base modelQwen3-32B

PlatformAzure AI Foundry

FormatOpenAI chat completions

Rate limits50K TPM / 50 RPM

Run 0

GCP Vertex AI

PlatformGCP Vertex AI

Experiment IDtuning-experiment-20260210130258433553

Total steps295

Eval checkpointsSteps 1, 100, 200, 295

Training Hyperparameters — All Runs

ParameterRun 0Run 1Run 2

Epochs~112

Learning rate—1e-4 (constant)1e-4 (constant)

LR schedule—None (flat)None (flat)

Warmup—0 steps0 steps

Total steps2957821,564

Runtime—~50 min~99 min

Throughput—8.24 samples/s8.43 samples/s

Total FLOPs—0.99 PF1.98 PF

Run 0 hyperparameters not exported by Vertex AI — platform manages LR schedule and batch size internally.

Dataset

15,434 mythology-focused training pairs built from Midjourney session data. Used across all three experiments — format varies by platform.

Raw Pairs

18,431

Before deduplication

Duplicates Removed

2,997

16.3% of raw data

Train Split

13,890

90% of unique pairs

Validation Split

1,544

10% held out

Token Statistics per Example

ComponentAvgMinMedianMax

System prompt294294294294

User message385373690

Model response161121352422

Total4943144664012

Quality Signals

31.1%

Liked (is_liked)

3,293 pairs

4.2%

Favorites (rating=5)

444 pairs

31.1%

Rated 4+

3,293 pairs

63.3%

Unrated (rating=0)

6,699 pairs

5.5%

Disliked (rating=1)

585 pairs

Results Summary

Eval loss and accuracy across all three runs. Run 2 holds the best eval loss. Run 0 is the first to track token-level accuracy.

RunPlatformSteps / EpochsBest Eval LossNote

Run 0GCP Vertex AI295 / ~1

1.18570.3% acc

295 steps · 44.7% reduction from start

Run 1Azure AI Foundry782 / 1

1.195

1 epoch · baseline

Run 2Azure AI Foundry1,564 / 2

1.017lowest eval loss

2 epochs · best loss, −14.8% vs Run 1 eval

04GCP Vertex AI

Run 0

The first experiment — GCP Vertex AI, 295 steps. A different platform with internally managed hyperparameters, and the only run to track token-level prediction accuracy. 44.7% eval loss reduction from starting baseline.

Starting Eval Loss

2.144

Step 1 baseline

Final Eval Loss

1.185

Step 295 — still improving

Eval Improvement

44.7%

Step 1 → step 295

Token Accuracy

70.3%

Fraction correct next-step preds

Eval Checkpoints

StepEval LossReduction from StartToken Accuracy

12.1441—55.9%

1001.3993↓ 34.7%66.7%

2001.2703↓ 40.8%68.8%

2951.1853↓ 44.7%70.3%

Convergence Phases

Steps 1–502.364 → 1.585Rapid initial descent, domain vocabulary acquisition

Steps 51–1001.585 → 1.486Plateau with steady refinement

Steps 101–1901.486 → 1.207Continued learning, style consolidation

Steps 191–2761.207 → 1.114Best performance, fine-grained pattern capture

Steps 277–2951.114 → 1.245Slight uptick at end of run

05Azure AI Foundry

Run 1

Single-epoch baseline on Azure AI Foundry — 782 steps, constant LR 1e-4. The model rapidly acquires mythology vocabulary and Midjourney parameter structure. Establishes the foundation for Run 2.

Starting Loss

2.564

Step 1, before any training

Train End Loss

1.126

Step 782, final train step

Eval Loss

1.195

Epoch 1 checkpoint

Loss Reduction

56.5%

From 2.564 to 1.195

06Best Eval Loss

Run 2

Extended to 2 epochs on the same dataset and hyperparameters. Epoch 2 improved eval loss by 14.8% over epoch 1 — the model consolidated style patterns rather than memorizing. Best overall eval loss across all three runs.

Starting Loss

2.564

Shared baseline with Run 1

Epoch 1 Eval

1.195

Step 782 checkpoint

Epoch 2 Eval

1.017

Step 1,564 — best across all runs

Epoch Improvement

14.8%

Epoch 1 → Epoch 2 eval

Gradient Norm — Epoch 1

Average1.592

Min0.585

Max7.807

Gradient Norm — Epoch 2 +49%

Average2.379

Min1.203

Max6.214

Spikes > 3.037 (4.7%)

100-Step Convergence Phases

Steps 1–1002.564 → 1.520Rapid domain vocabulary acquisition

Steps 101–2001.520 → 1.397Style and structure learning

Steps 201–3001.397 → 1.422Pattern consolidation

Steps 301–4001.422 → 1.376Plateau with micro-gains

Steps 401–5001.376 → 1.461Fluctuation — data variety impact

Steps 501–6001.461 → 1.343Continued refinement

Steps 601–7001.343 → 1.299Diminishing returns

Steps 701–7821.299 → 1.126Late surge before eval

Analysis

Patterns that held across all three runs, and the persistent issues that need to be resolved before Run 3.

What Worked

Strong Epoch 1 Learning

56.5% loss reduction. The model rapidly acquired domain vocabulary — mythology characters, cinematography terms, Midjourney parameters.

Epoch 2 Generalization

Despite only 8.2% train loss reduction, eval loss improved 14.8%. Epoch 2 consolidated patterns rather than memorizing.

Reproducibility

Both runs produced nearly identical metrics through step 782. Training on Azure AI Foundry is deterministic.

Efficient Throughput

8.24–8.43 samples/s is reasonable for a 32B parameter model, indicating efficient batch utilization.

Concerns

training

No Warmup Phase

LR jumps from 0 to 1e-4 at step 2. Initial gradient norm (2.125) is high. A 10% warmup (~156 steps) would smooth early training.

training

Constant Learning Rate

Flat 1e-4 means equal step sizes throughout. Cosine decay would stabilize later training and address the plateau at steps 1100–1200.

training

Gradient Instability (Epoch 2)

Average grad norm increased 49% (1.592 → 2.379). 37 spikes >3.0 suggest unnecessarily large updates. Gradient clipping (max_norm=1.0) recommended.

training

Train-Eval Gap Inversion

Train loss (1.195) > eval loss (1.017) is atypical. Possible: val set is easier, train includes noisier examples (63.3% unrated), or insufficient eval checkpoints.

training

Eval Frequency Too Low

Only end-of-epoch evaluation. Cannot determine where optimal checkpoint lies within each epoch. Best practice: every 100–200 steps.

data

63.3% Unrated Training Data

Most examples have no quality signal. Including disliked prompts (5.5%) may teach the model to produce suboptimal outputs.

data

Video Underrepresentation

Only 291/15,434 pairs (1.9%) are video-related. If video analysis is a key use case, target 5–10% representation.

data

Single-Turn Only

No multi-turn examples. The model won't learn iterative refinement or follow-up feedback — critical for a creative director role.

data

Uniform System Prompt

All examples use the same 1,175-char system prompt. Creates tight coupling — model may underperform with any variation.

Run 3 Recommendations

Addressing the training instability, data quality gaps, and evaluation gaps identified across Runs 0–2.

ParameterCurrent (Runs 1 & 2)RecommendedRationale

Epochs23–4Eval loss still improving; extend to find plateau

Learning rate1e-45e-5Lower peak to reduce gradient instability

LR scheduleConstantCosine decayAggressive early, stable later

Warmup0%10% (~156 steps)Smooth initial training

Eval frequencyPer epochEvery 200 stepsFind optimal checkpoint

CheckpointPer epochBest eval lossSave the actual best model

Grad clippingDefaultmax_norm=1.0Cap 4.7% of grad spikes

ChangeImpactPriority

Quality filter: liked/rated only15,434 → ~5,200 pairs. Remove 63% unrated + 5.5% dislikedHIGH

Build multimodal dataset+500–800 image+analysis pairs from 817 GCS imagesHIGH

Add multi-turn examplesTeaches iterative refinement (critical for creative director)MEDIUM

Augment video dataCurrently 1.9% → target 5–10% of training dataMEDIUM

Vary system prompts3–5 variants to reduce overfitting to exact wordingLOW

Held-out test set — Curated 100–200 examples covering all task types — not in training or validation

Task-specific eval — Track loss separately for prompt gen, image analysis, video replication, and refinement tasks

Human eval baseline — Score 50 outputs on accuracy, mythology depth, style, and production-readiness

A/B test vs base — Compare fine-tuned Qwen3-32B vs base model vs Gemini on identical prompts

Infrastructure

Three blockers remain before Run 3 can start. Data pipeline and format conversion are the critical path.

Raw Midjourney Data

21,318 images + 8,586 videos in GCS

ready

Training Pairs

15,434 pairs (Gemini format)

ready

Image Manifest

817/817 transferred to GCS

ready

Multimodal Dataset

Run build_multimodal_dataset.py

pending

Azure Format Conversion

Write convert_to_azure.py

blocked

Quality-Filtered Dataset

Filter to liked/rated only

pending

Qwen3-32BCreative Director

Experiment Setup

Azure AI Foundry

GCP Vertex AI

Training Hyperparameters — All Runs

Dataset

Token Statistics per Example

Quality Signals

Results Summary

Run 0

Eval Checkpoints

Convergence Phases

Run 1

Run 2

Gradient Norm — Epoch 1

Gradient Norm — Epoch 2 +49%

100-Step Convergence Phases

Analysis

What Worked

Strong Epoch 1 Learning

Epoch 2 Generalization

Reproducibility

Efficient Throughput

Concerns

No Warmup Phase

Constant Learning Rate

Gradient Instability (Epoch 2)

Train-Eval Gap Inversion

Eval Frequency Too Low

63.3% Unrated Training Data

Video Underrepresentation

Single-Turn Only

Uniform System Prompt

Run 3 Recommendations

Infrastructure

Raw Midjourney Data

Training Pairs

Image Manifest

Multimodal Dataset

Azure Format Conversion

Quality-Filtered Dataset

Qwen3-32B
Creative Director