PROFIT: A Specialized Optimizer for Deep Fine Tuning

Abstract

The fine-tuning of pre-trained models has become ubiquitous in generative AI, computer vision, and robotics. Although much attention has been paid to improving the efficiency of fine-tuning model, there has been less scholarship around fine-tuning specifically for improved model performance. To remedy this gap, we present PROFIT, one of the first optimizers designed to incrementally fine-tune converged models on new tasks and/or datasets. Unlike traditional optimizers such as SGD or Adam, which make minimal assumptions due to random initializations, PROFIT takes the properties of a converged model into account explicitly to regularize the optimization process. Employing a temporal gradient-orthogonalization process, PROFIT outperforms fine-tuning methods in various tasks, from image classification to multimodal language model training to large-scale motion prediction. Moreover, PROFIT is encapsulated as a modular optimizer, which makes it easy to integrate directly into any training pipeline with minimal engineering effort.

What is PROFIT?

Comparison between a standard optimizer and the PROFIT optimizer

PROFIT regularizes updates by orthogonalizing and translating gradients from a converged model, improving adaptation to new tasks.

PROFIT shows better at preserving the prior knowledge.

Toy example comparing ground truth, standard fine-tuning, and PROFIT fine-tuning results

Performance on Original Data

Full Fine-Tuning Error	0.705
PROFIT Error	0.046

PROFIT reduces error on original data by ~15x compared to full fine-tuning.

Better at Learning on new tasks, less forgetting of old ones.

Relative performance improvement over SGD baseline for CIFAR experiments

This is an image classification experiment.
Pre-training on CIFAR10 and fine-tuning on CIFAR100 distributions are proximal (sampled from Tiny Images) datasets. PROFIT achieves the best balance. It outperforms the specialized optimizers on the new tasks, while reduces forgetting on the old ones.

More accurate and confidence motion prediction in autonomous driving

Motion prediction results comparing PROFIT and baseline fine-tuning

In Waymo Open dataset, PROFIT produces more confident predictions that align better with the ground truth compared to the fine-tuning either only-head or full-model fine-tuning methods.

A warmup makes PROFIT effective when dealing with the data distribution shift.

PROFIT warmup closing performance gaps across task types

The VTAB-1K benchmark, where ImageNet pre-training and fine-tuning are no longer proximal. A warmup fine-tuning with the standard optimizer Adam brings it "closer" to the target distribution, we can apply the PROFIT to achieve better performance on the VITAB-1k dataset.

Fine-tuning Vision-Language Models (VLMs) for autonomous driving

DriveLM comparison between baseline fine-tuning and PROFIT fine-tuning

On this particular example from the DriveLM dataset, PROFIT improves model's ability to perceive the traffic light and black sedan in the scene, while the baseline (fine-tuned with AdamW) does not detect the traffic light and hallucinates the presence of a white truck.

Conclusions

As Machine Learning shifts to fine-tuning, our tools must evolve too. PROFIT is a critical step towards a new class of optimizers designed for adapting the most powerful foundational models. Future works are prioritize to implement PROFIT memory efficient to handle large foundational models.

BibTeX

@inproceedings{chakravarthy2025profit,
  author    = {Anirudh S Chakravarthy and Shuai Kyle Zheng and Xin Huang and Sachithra Hemachandra and Xiao Zhang and Yuning Chai and Zhao Chen},
  title     = {{PROFIT: A Specialized Optimizer for Deep Fine Tuning}},
  booktitle = {NeurIPS},
  year      = {2025},
  url       = {https://arxiv.org/abs/2412.01930}
}

Poster Overview

Poster summarizing the PROFIT optimizer research

Present at NeurIPS 2025 poster session, San Diego Convention Center, Exhibit Hall C,D,E #905, Wednesday, December 3rd, 2025. 11 a.m. PST — 2 p.m. PST.

Acknowledgments

We thank Carl Vondrick, Greg Meyer, Eric Wolff, Siddhartha Srinivasa, Hongge Chen, David Hayden, Yifeng Zeng, Navaneeth Bodla, Ajaya h s Rao, Ankit Raj, Annie Liu, Gweltaz Lever, Raghid Mardini, and Pratik Agarwal for their helpful feedback and insightful discussions.