Adaptive Agency

Building a Funding Foundation Model: Reinforcement Learning from the Model's Own Mistakes

Vicunous Research

Having optimized model inputs via XGBoost-based feature selection, we turn to optimizing the model itself. Using Direct Preference Optimization (DPO) with hard-negative mining — preference pairs constructed from the SFT model's own mistakes — we find that the hard-negative rate is a reliable pre-training diagnostic: below ~25% it adds no value over plain SFT, but above ~30% it delivers measurable recall improvements on the hardest classification tasks in our funding suite.

In our previous post, we introduced an XGBoost-based feature selection approach to address high-dimensional datasets suffering from noise. The results were clear: pruning irrelevant features helped the general-purpose model on regression tasks, but had a negligible effect on classification (around -1% for both TabPFN and the fine-tuned GPT-4o mini). The classification models, it seemed, were already handling feature noise reasonably well on their own.

This raised a natural follow-up question: if we have already optimized the input to the model, can we instead optimize the model itself? Rather than curating better data, can we refine how the model learns from its mistakes?

In this post, we shift our focus from data optimization to model optimization. We introduce Direct Preference Optimization (DPO) — a reinforcement learning technique that teaches the model to prefer correct answers over its own prior errors. It is worth noting that while feature selection was our primary tool for regression, DPO is applicable only to classification in our setup: the framework would treat each continuous numerical prediction as a distinct class, which sits awkwardly with the logic of regression. Where feature selection is a filter applied before the model, DPO targets the model's decision boundary by reusing its own confusions as the training signal. Together, these two approaches give us two distinct levers for pushing a general-purpose model toward vertical-expert performance on classification tasks.

What Is DPO?

In our earlier benchmarking, we demonstrated the fine-tuning factor. Supervised Fine-Tuning (SFT) works by imitating correct answers. The model sees thousands of examples of correct predictions and learns to reproduce them. But imitation has a ceiling: the model learns what to predict without learning why one answer is better than another.

DPO addresses this gap. Instead of training on correct examples alone, DPO presents the model with preference pairs — a preferred output (the correct answer) alongside a non-preferred output (an incorrect answer) — and trains it to discriminate between the two. The key advantage over full Reinforcement Learning from Human Feedback (RLHF) is that DPO requires no separate reward model; the preference pairs themselves encode the reward signal, making it simpler and cheaper to implement. Whether the preference-pair framing actually produces meaningfully different training behavior from plain SFT in our binary-classification setting is a question we return to in the Results section.

For the funding domain, this framing carries operational weight. In lending decisions, the cost of a false negative (failing to flag a defaulting borrower) and the cost of a false positive (rejecting a creditworthy applicant) are rarely symmetric. Shifting the precision-recall tradeoff — by training the model on the specific boundary cases where it struggles — is the kind of calibration a funding agent might need.

The Migration — GPT-4o mini to GPT-4.1 mini

Before we could apply DPO, we encountered a practical constraint: GPT-4o mini does not support DPO through OpenAI's fine-tuning API. This led us to migrate to GPT-4.1 mini, which required re-running the SFT stage on all five classification datasets to establish a new baseline.

This migration was not free. To measure the cost, we compared the SFT performance of both base models across our classification suite.

Dataset	SFT (GPT-4o mini)	SFT (GPT-4.1 mini)	Change
Credit Card Fraud	0.948	0.940	−0.8
Loan Approval	0.962	0.962	0.00
Give Me Some Credit	0.754	0.761	+0.7
Prosper Loan	0.778	0.775	−0.3
Lending Club Loan	0.676	0.636	−4.0

Table 1: SFT model comparison (F1 score).

For all DPO results that follow, the SFT baseline refers to the GPT-4.1 mini checkpoint.

The Pipeline — From SFT to DPO

Our DPO experiment follows a five-stage pipeline. Each stage feeds into the next, with the SFT model serving double duty: as both the starting checkpoint for DPO training and the source of the hard-negative mining signal.

Stages 1 and 2 follow the same SFT approach as in previous posts, and the model is fine-tuned on these examples. The novel contribution begins at Stage 3.

Hard-Negative Mining

The standard approach to constructing DPO preference pairs is to take the correct label and pair it with a random wrong label. The issue here is that if the model already knows the random answer is wrong, the training signal carries little information — there is not much new for the model to learn.

Our pipeline takes a different approach. Instead of fabricating wrong answers, we query the SFT model itself on candidate rows and collect its actual mistakes:

Sample up to 2,000 candidate rows from the training data. We selected this threshold for two main reasons:
- Bounded API costs: each candidate requires a live inference call to the SFT model, so capping the samples keeps our experimental budget manageable.
- Cross-dataset comparability: our library varies by nearly 190x in scale (383k rows for Lending Club vs. ~2k for Credit Card Fraud), so a uniform cap keeps DPO training intensity consistent across datasets.
Run the SFT model on each candidate.
Partition predictions by correctness:
- SFT wrong → hard-negative pair: preferred = ground truth, non-preferred = the SFT model's actual wrong prediction.
- SFT correct → topped up as a random-negative pair (correct label paired with a random wrong label).
Balance pairs across classes and split 90/10 into train/validation.

The intuition is simple: it is like giving a student a practice exam, marking what they got wrong, and then drilling them specifically on those questions — not on random questions they already know. The training signal is concentrated precisely where the model is confused.

Dataset	Candidates Queried	Hard Negatives (SFT Wrong)	Hard-Negative Rate	Total DPO Pairs
Credit Card Fraud	662	49	7.4%	662
Loan Approval	2,000	88	4.4%	1,976
Give Me Some Credit	2,000	391	19.6%	1,981
Prosper Loan	2,000	487	24.4%	1,876
Lending Club Loan	2,000	632	31.6%	1,868

Table 2: Hard-negative mining statistics per dataset.

The hard-negative rate — the fraction of candidates the SFT model gets wrong — varies across datasets: from 4.4% on Loan Approval (where SFT is already strong) to 31.6% on Lending Club Loan (where SFT genuinely struggles). As we will show, this rate tends to predict how much room DPO has to improve.

Results

We evaluated each DPO model on its dataset's held-out test split, capped at 1,000 samples per dataset for consistency with existing leaderboard entries.

Dataset	Base (GPT-4o mini)	SFT (GPT-4.1 mini)	Pure DPO (GPT-4.1 mini)	SFT → DPO (GPT-4.1 mini)	Δ (SFT→DPO − SFT)
Credit Card Fraud	0.8000	0.9403	0.9329	0.9403	0.0000
Loan Approval	0.8365	0.9617	0.9617	0.9617	0.0000
Give Me Some Credit	0.6980	0.7609	0.7618	0.7581	−0.0028
Prosper Loan	0.6649	0.7747	0.7694	0.7567	−0.0180
Lending Club Loan	0.6702	0.6360	0.6370	0.6647	+0.0287

Table 3: F1 scores across the training progression.

The results tell a nuanced story. DPO matches SFT exactly on two datasets (Credit Card Fraud and Loan Approval), regresses slightly on two others (Give Me Some Credit and Prosper Loan), and delivers a meaningful improvement on the one dataset where SFT struggles most: Lending Club Loan.

Table 3 also reveals a second, more striking pattern. On four of five datasets, Pure DPO — preference optimization applied directly to the base GPT-4.1 mini model, with no SFT in front of it — closely tracks the full SFT→DPO pipeline. Loan Approval is identical across the three columns; on Credit Card Fraud, Give Me Some Credit, and Prosper Loan, Pure DPO and SFT→DPO sit within ~1.3 pp F1 of each other. Lending Club Loan is the lone exception: Pure DPO (0.6370) lands at SFT level, 2.77 pp below SFT→DPO (0.6647).

The reason Pure DPO and plain SFT track each other so closely on our task is structural. For binary classification, there are only two possible labels. Every DPO preference pair looks like prefer "0" over "1" (or the reverse) — a wordy way of saying the right answer is 0, the same information a standard SFT example carries. Layer on a mild "don't drift too far from the model you started with" pull toward the reference checkpoint, and DPO ends up teaching the model much the same lesson as SFT.

That same framing explains the Lending Club gap. SFT→DPO starts from a checkpoint that already predicts both classes; Pure DPO starts from a base model that predicts class "1" for almost every row. The hard negatives — mined from the SFT model's actual mistakes — were built to fix SFT's confusion boundary, not the base model's strong "always say 1" prior. Applied to the base model, the same signal misfires, and the training budget is not enough to overcome that bias. The four other datasets escape this failure mode because their training pairs are dominated by random negatives, which carry the same information regardless of which reference model the run starts from.

The Standout: Lending Club Loan

Lending Club Loan is the largest and most complex classification dataset in our suite (~383k rows, predicting credit risk). It is also the dataset where GPT-4.1 mini SFT regressed most sharply from the GPT-4o mini SFT baseline (−4.03 pp F1). DPO recovered most of that ground — and then some.

Model	Accuracy	Precision	Recall	F1
Base (GPT-4o mini)	0.504	0.504	1.000	0.670
SFT (GPT-4.1 mini)	0.652	0.637	0.635	0.636
DPO (GPT-4.1 mini)	0.656	0.623	0.712	0.665

Table 4: Lending Club Loan — full metric breakdown.

DPO traded −1.39 pp precision for +7.72 pp recall, lifting F1 by 2.87 pp. The recall figure of 0.712 is not just an improvement over SFT — it is the best recall score on the full cross-model leaderboard for this dataset, surpassing TabPFN's 0.711.

Why did DPO succeed here? We think the answer lies in Table 2: Lending Club Loan had the highest hard-negative rate at 31.6%. Nearly one in three candidate rows was a case where the SFT model predicted incorrectly. DPO had abundant, high-quality training signals — real examples of the model's confusion at the decision boundary.

For the funding domain, this tradeoff is operationally meaningful. In credit risk assessment, higher recall means catching more potential defaults. A −1.39 pp drop in precision (slightly more false positives) is a modest cost for a +7.72 pp gain in recall (significantly fewer missed defaults).

When DPO Doesn't Help

On Credit Card Fraud and Loan Approval, DPO produced identical results to SFT. The explanation looks mechanical: both datasets had hard-negative rates below 10% (7.4% and 4.4% respectively). The SFT model was already performing well, leaving few real mistakes for DPO to learn from. With the preference dataset dominated by random-negative pairs rather than hard negatives, DPO had insufficient signal to shift the model's behavior. This is not a failure of DPO — it is DPO working exactly as designed. The technique optimizes at the margin of the model's uncertainty. When there is no meaningful uncertainty, there is nothing to optimize.

Give Me Some Credit (−0.28 pp F1) and Prosper Loan (−1.80 pp F1) showed slight performance decreases. A closer look at the per-metric shifts reveals a consistent pattern:

Give Me Some Credit: DPO shifted toward precision (+1.96 pp) at the expense of recall (−2.46 pp). The model became slightly more conservative in its positive predictions.
Prosper Loan: DPO regressed on recall (−3.39 pp) on a class-imbalanced test set (~68/32 negative/positive). The model again tilted toward the majority class.

At n = 1,000 test samples, both deltas are within the statistical noise band. But the direction is informative: on datasets where hard negatives are relatively scarce (19.6% and 24.4%), the DPO training set is mostly composed of random-negative pairs. This diluted signal may teach a blunt "be more conservative" heuristic rather than a nuanced correction. Contrast this with Lending Club Loan, where the high density of hard negatives allowed DPO to learn a targeted correction that pushed recall upward.

More Training, Same Result

If binary DPO is, mechanically, a flavour of SFT, we should be able to reproduce most of the SFT→DPO gain on Lending Club Loan with a second epoch of plain SFT. To test this, we trained a comparative model, and the subsequent results were revealing.

Model	Accuracy	Precision	Recall	F1
SFT (1 epoch)	0.652	0.637	0.635	0.636
SFT (2 epochs)	0.655	0.624	0.704	0.661
SFT → DPO	0.656	0.623	0.712	0.665

Table 5: Lending Club Loan — DPO vs. extended SFT.

Recall climbed from 0.635 to 0.704 with no preference pairs, no hard-negative mining, no DPO machinery at all — just one more pass over the SFT training data. The full DPO pipeline closes only the last 0.4 pp F1 from there. The Lending Club gain is, in large part, the gain from continuing to train.

Hard-Negative Density — The Predictive Variable

A practical takeaway from this experiment is that the hard-negative rate — measurable before any DPO training begins — tends to predict the size of the SFT→DPO gap over plain SFT.

Hard-Negative Rate	Dataset	DPO Δ (F1)	Outcome
4.4%	Loan Approval	0.0000	No change
7.4%	Credit Card Fraud	0.0000	No change
19.6%	Give Me Some Credit	−0.0028	Slight regression
24.4%	Prosper Loan	−0.0180	Slight regression
31.6%	Lending Club Loan	+0.0287	Clear improvement

Table 6: Hard-negative rate vs. SFT→DPO F1 delta over plain SFT.

The pattern is fairly consistent, and given the earlier argument that binary DPO is structurally close to SFT, the mechanism comes into view. Below ~10%, SFT has already learned the task; the preference pairs are dominated by random negatives that contribute little information beyond a labeled example, and DPO matches SFT exactly. In the middle range (19–25%), the training set is a mixed bag — some genuine hard negatives diluted by random pairs — and the precision-recall tilt introduced by DPO can fall the wrong way. Above 30%, two conditions align: the SFT model has enough real mistakes to constitute a meaningful error surface, and the hard negatives are dense enough that "another pass of training, weighted toward the SFT model's mistakes" delivers a measurable lift.

This converts the mining step (Stage 3) into a free diagnostic — the statistics land before any DPO training cost is incurred — and lets us state a sharper rule of thumb:

If hard-negative rate < ~25%: skip the SFT→DPO pipeline. The extra stages — preference-pair construction, hard-negative mining via live inference calls on the SFT model, and a separate DPO training run — introduce real engineering complexity and added training cost for a result that, in our setup, either matches plain SFT or trends slightly worse.
If hard-negative rate > ~30%: the SFT→DPO pipeline does deliver a measurable lift — but, as Table 5 showed, a second SFT epoch on the same data captures most of that gain at a fraction of the complexity. Reach for the full DPO pipeline only after that simpler fallback has been tried and you specifically need the recall tilt DPO produces at the decision boundary.

The broader read across these five datasets is that, on binary classification, the SFT→DPO pipeline rarely seems to justify its added complexity over plain SFT. Hard-negative density tells us, before any training budget is spent, which side of that line a dataset sits on.

Conclusion

After five posts in this series, we have assembled a progressively refined understanding of what it takes to push a general-purpose model toward vertical-expert performance in the funding domain. DPO via hard-negative mining adds a new tool to that toolkit.

Key takeaways:

For binary classification, DPO is structurally close to SFT. With two labels, every preference pair ("prefer 0 over 1") carries roughly the same information as a standard labeled example ("the answer is 0") — DPO ends up teaching the model much the same lesson SFT teaches, with a mild extra pull to stay close to the starting checkpoint. The Lending Club Loan gain decomposes into two factors — an SFT starting point that already predicts both classes, and a second pass of training — both of which a 2-epoch SFT run reproduces directly, without DPO or hard-negative mining.
Hard-negative density tends to predict the SFT→DPO gap. This is measurable before DPO training begins, making it a useful free diagnostic. Below ~25%, the DPO pipeline does not appear to pay for its added complexity — results either match plain SFT or trend slightly worse. Above ~30%, the pipeline does deliver a measurable lift — but an extra SFT epoch is often the simpler way to capture most of that same gain.
On the hardest dataset (Lending Club Loan, 31.6% hard-negative rate), DPO delivered +2.87 pp F1 and the best recall on the full cross-model leaderboard. The recall record is also reachable with a 2-epoch SFT (0.704), reinforcing that the win is "more training on the right data" rather than something only DPO can produce.