Prefix-Tuning: Optimizing Continuous Prompts for Generation

Deep Learning/Algorithm

Prefix-Tuning: Optimizing Continuous Prompts for Generation

나희와더기 2025. 5. 23. 19:11

이번 논문은 2021년 Stanford의 Li et al.이 발표한 "Prefix-Tuning: Optimizing Continuous Prompts for Generation"입니다. 대형 언어 모델(예: GPT-2, BART)을 다양한 자연어 생성 작업에 활용하려면 일반적으로 Fine-tuning을 수행해야 합니다. 하지만 모델 크기가 수억~수백억 파라미터에 달하면서, 작업마다 전체 모델을 복사해 학습하는 것은 비용과 저장 측면에서 매우 비효율적입니다. 이 논문에서는 이러한 문제를 해결하기 위해 Prefix-Tuning이라는 새로운 접근 방식을 제안합니다

Paper: https://arxiv.org/abs/2101.00190

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Fine-tuning is the de facto way to leverage large pretrained language models to perform downstream tasks. However, it modifies all the language model parameters and therefore necessitates storing a full copy for each task. In this paper, we propose prefix-

arxiv.org

Code: https://github.com/XiangLi1999/PrefixTuning

GitHub - XiangLi1999/PrefixTuning: Prefix-Tuning: Optimizing Continuous Prompts for Generation

Prefix-Tuning: Optimizing Continuous Prompts for Generation - XiangLi1999/PrefixTuning

github.com

🎯 왜 Prefix-Tuning인가?

기존 Fine-Tuning의 한계
- GPT-2 (774M), GPT-3 (175B) 파라미터 → task마다 전체 복사 필요
- 저장 공간/메모리/훈련시간 폭발적 증가
Adapter 방식의 대안
- 파라미터 수는 줄지만, 모델 내 구조 자체 수정이 필요
- 추론 시 레이턴시 증가, 병렬 처리 어려움
Prefix-Tuning의 핵심 아이디어
- 기존 모델 파라미터는 모두 고정 (freeze)
- 입력 앞에 연속적인 벡터(prefix vector) 를 붙여 학습
- 이 벡터는 Transformer의 모든 층에서 가상 토큰처럼 작동
- 학습 대상: prefix vector만! (GPT-2 전체의 0.1%만 학습)

🧠 구조 및 동작 원리

적용 방식
- 입력 시퀀스를 [PREFIX; x; y] 형태로 구성 (Autoregressive 모델 기준)
- 각 time-step의 activation 중 prefix 구간은 학습 가능한 벡터로 직접 지정
- 나머지 구간은 기존 Transformer가 처리하되, prefix를 주의(attention) 가능
  - $ h_{i} =
    \begin{cases}
    P_\theta[i], & \text{if } i \in \text{prefix indices} \\
    \text{LM}_\phi(z_i, h_{<i}), & \text{otherwise}
    \end{cases} $
  - $P_{\theta}$: 학습 가능한 prefix 벡터 모음
  - $\phi$: 사전학습된 LM 파라미터 (고정)
중요한 최적화 기법
- $P_{\theta}$ 자체를 직접 학습하면 불안정 → MLP를 통한 재매개화(reparameterization) 도입
- MLP를 통해 더 작은 차원의 벡터로부터 prefix 벡터를 생성해 안정적인 학습 유도

📊 실험 및 결과

Table-to-Text: GPT-2 적용 (E2E, WebNLG, DART)
- 단 0.1%의 파라미터만 학습했는데, Full Fine-Tuning보다 더 높은 성능도 가능!
- 특히 WebNLG의 unseen topic에서도 더 잘 일반화함

Summarization: BART 적용 (XSUM)
- 텍스트 요약처럼 복잡한 작업에서는 성능이 살짝 떨어지지만, 여전히 경쟁력 있음.

Few-shot 및 일반화 성능
- Low-data 환경 (예: 50~500개 샘플) 에서 Fine-tuning보다 더 높은 성능
- Unseen topic generalization (뉴스 → 스포츠 등)에서도 우수한 성능 보임

Intrinsic 분석 및 ablation
- Prefix 위치: 앞(PREFIX)이 뒤(INFIX)보다 성능 우수
- Embedding-only vs Full: 가상 토큰 임베딩만 학습하면 성능 급락
- 초기화: 실제 단어(예: "summarization") 기반 초기화가 random보다 우수

💡 요약 및 의의

항목	Fine-Tuning	Adapter	Prefix-Tuniing
전체 모델 수정	✅	❌	❌
파라미터 수	높음	중간	매우 낮음 (~0.1%)
배치 간 병렬 처리	❌	❌	가능
unseen task 일반화	보통	좋음	우수

📝 마무리

Prefix-Tuning은 단순한 Prompt 방식보다 훨씬 표현력이 강하면서도, Fine-tuning보다 훨씬 경량이고 확장성이 높습니다.
특히 사용자별 personalization, cloud inference batching, few-shot 학습 등 실제 응용에 매우 적합한 구조입니다.

'Deep Learning > Algorithm' 카테고리의 다른 글

Soft-DTW: a Differentiable Loss Function for Time-Series (1)	2025.05.22
Parameter-Efficient Transfer Learning for NLP (0)	2025.05.14
LoRA: Low-Rank Adaptation of Large Language Models (0)	2025.05.13

현재글Prefix-Tuning: Optimizing Continuous Prompts for Generation

인공지능 개발자를 꿈꾸는 더기

인공지능 연구를 수행하면서 겪었던 경험을 기록하기 위한 블로그입니다.

hr-cnn, soft-dtw, yoloe, 비접촉식 심박수 추정, physnet, rhr, yolo-world, factorizephys, physformer++, rggb to rgb, 애드센스, 티스토리, Generative model, deepphys, prefix-tuning, rPPG, ip-adapter, physformer, efficientphys, mtts-can,

Today :
Yesterday :

인공지능 개발자를 꿈꾸는 더기