Quantization & Prompt Engineering

1. bitsandbytes

https://huggingface.co/docs/transformers/ko/quantization/bitsandbytes

bitsandbytes

bitsandbytes는 모델을 8비트 및 4비트로 양자화하는 가장 쉬운 방법입니다. 8비트 양자화는 fp16의 이상치와 int8의 비이상치를 곱한 후, 비이상치 값을 fp16으로 다시 변환하고, 이들을 합산하여 fp16으

huggingface.co

bitsandbytes는 모델을 8비트 및 4비트로 양자화하는 가장 쉬운 방법이다. 4비트 양자화의 경우, QLoRA와 함께 사용하여 양자화된 대규모 언어 모델을 미세조정하는데 흔히 사용된다.

config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type = 'nf4' , 
    bnb_4bit_use_double_quant = True, 
    bnb_4bit_compute_dtype = torch.bfloat16)

2. FlashAttention

https://huggingface.co/docs/transformers/perf_infer_gpu_one

GPU inference

GPUs are the standard choice of hardware for machine learning, unlike CPUs, because they are optimized for memory bandwidth and parallelism. To keep up with the larger sizes of modern models or to run these large models on existing and older hardware, ther

huggingface.co

# load in 4bit
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    attn_implementation="flash_attention_2",
)

논문에서는 flash-attn을 썼을 때 연산 속도를 2~4배 정도 줄일 수 있다고 한다. 하지만 논문에서 실험한 환경과 우리가 직접 실험하는 환경이 다른 것을 잊어서는 안된다. .

3. BetterTransformer

kernel fusion과 nested tensor을 활용한 transformer 구조

kernel Fusion : 두 개 이상의 Kernel

1) 독립적인 계산들을 하나로 통합하여 메모리 이동을 최소화

2) 입력의 희소성을 활용하여 패딩 토큰에 불필요한 작업을 하지 않도록 함

Encoder 계열에 조금 더 적합한 기능이라고 볼 수 있다.

4. Prompt Engineering

Special Token 이란 LLM이 텍스트 데이터를 처리할 때 사용되는 의미있고 특수한 토큰이다.

더 이상 나눠지지 않는 하나의 토큰이며, LLM이 올바른 지시를 이해할 수 있도록 도와준다.

특별한 의미를 표현하거나, 특정 작업을 위한 신호 역할을 수행한다.

apply_chat_template를 활용하면, 모델이 해당 tokenizer에 맞게 special token(예: <|user|>, <|assistant|>, <|system|>)으로 변환해줍니다. 이렇게 하면 LLM이 대화 구조를 더 쉽게 이해할 수 있도록 최적화되며, 적절한 문맥을 유지하면서 응답할 수 있습니다.

사용 모델 : llama 모델

작은 모델은 quantization해서 사용하는 것 보다 큰 모델을 quantization하는 것이 성능이 더 좋다고 알려져있다.

1) Multi Turn Prompt Engineering

messages = [ 
    { "role" : "system" , "content" : "You ar a nice chatbot that helps users. You always have to respond briefly, within three sentences."},
    { "role" : "user" , "content" : "What is the capital of the United States?"} ,
    { "role" : "system" , "content" : "The Capital of the United States is Washington D.C"},
    { "role" : "user" , "content" : "Then, What about Korea?"}

input_ids = tokenizer.apply_chat_template(
    messages ,
    add_generation_prompt=True, #추론시에는 True 로 설정
    return_tensors='pt'
    ).to(model.device) 
    
print(tokenizer.decode(input_ids[0]))

terminators = [
	tokenizer.eos_token_id,
    tokenzier.convert_tokens_to_ids("<|eot_id|>") 
    ] 
    
outputs = model.generate(
	input_ids,
    max_new_tokens = 256, 
    eos_token_id = terminators,
    do_sample = True,
    temperature = 0.6
    ) 
    
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response , skip_special_tokens = True))

2) Zero-shot Prompt Engineering

messages = { 
	
    { 
    	"roie" : "system",
        "content" : "You ar a korea robot that summarizes documents. You MUST answer in korea" 
    } 
    
    {  
    	"role" : "user" , 
        "content" : """
        ###document : 기후 변화는 수십 년에서 수백만 년에 걸치 기간 동안의 기상 패턴의 통계적 분포에서 장기적인 변화를 의미합니다.
        이는 평균 기상 조건의 변화 또는 평균ㄴ 조건 주변의 기상 분포의 변화를 의미할 수 있습니다. 
        """
    }  
} 

input_ids = tokenizer.apply_chat_template(
	messages ,
    add_generation_prompt = True,
    return_tensors = 'pt',
).to(model.device)

print(tokenizer.decode(input_ids[0]))

terminators = [
	tokenizer.eos_token_id,
    tokenzier.convert_tokens_to_ids("<|eot_id|>") 
    ] 
    
outputs = model.generate(
	input_ids,
    max_new_tokens = 256, 
    eos_token_id = terminators,
    do_sample = True,
    temperature = 0.6
    ) 
    
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response , skip_special_tokens = True))

3) Few-shot Prompt Engineering

messages = [ 
    { 
        "role": "system",
        "content": "You are a Korean robot that summarizes documents. You MUST answer in Korean."
    },
    {  
        "role": "user", 
        "content": "###document : 기후 변화는 수십 년에서 수백만 년에 걸친 기간 동안의 기상 패턴의 통계적 분포에서 장기적인 변화를 의미합니다. 이는 평균 기상 조건의 변화 또는 평균 조건 주변의 기상 분포의 변화를 의미할 수 있습니다."
    },
    {  
        "role": "assistant", 
        "content": "기후 변화는 장기간에 걸친 기상 패턴의 변화로, 평균 기상 조건이나 분포가 달라지는 현상을 의미합니다."
    },
    {  
        "role": "user", 
        "content": "###document : 인공지능은 인간의 학습 능력과 논리적 추론, 문제 해결 등을 모방하는 기술입니다. 주로 기계 학습과 심층 학습을 활용하여 데이터를 분석하고 예측을 수행합니다."
    },
    {  
        "role": "assistant", 
        "content": "인공지능은 인간의 학습과 문제 해결 능력을 모방하며, 기계 학습과 심층 학습을 이용해 데이터 분석과 예측을 수행하는 기술입니다."
    },
    {  
        "role": "user", 
        "content": "###document : 블록체인은 데이터를 분산 저장하여 보안성과 투명성을 높이는 기술입니다. 중앙 기관 없이 거래를 검증하며, 금융, 물류 등 다양한 분야에 활용됩니다."
    }
]

4) Chain-of-Though(CoT) Prompt Engineering

mistral 모델의 경우에는 system 메세지가 따로 없다.

5) Generated Knowledge Prompt Engineering

step 1. 지식생성

step 2. 질문과 함께 지식 넣어주기

지식생성 프롬프트 엔지니어링 경우에는 RAG 와 비슷한 효과를 낼 수 있다. 이에 대한 차이는 LLM이 만든 데이터인가 혹은 외부 데이터를 사용하는가에 대한 차이다. 만일 외부 데이터가 LLM이 생성할 수 있는 데이터라고 생각을 한다면 굳이 DB를 만들며 RAG를 활용할 필요는 없는 것이다.

6) Self-Ask Prompt Engineering

gemma의 경우에는 user와 assistant로 구성되어 있다. 따라서 system 의 경우에는 그냥 user의 최상단 prompt에 넣어주면 된다.

messages = [ 
    { 
        "role": "user",
        "content": """
        You are an English Teacher who teaches Korean Students.
        You always have to explain in the format of a conversation between a student and teacher.
        sentence : 나는 아버지가 방에 들어가는 모습을 보고 많이 후회되고 힘들었다. 
        """
    },
    {  
        "role": "assistant", 
        "content": """
        Teacher  : What is the verb in the sentence?  
        Student : The Verb is '후회하고 힘들어했다' which translates to regretted and struggled.
        Teacher : What is the object of the sentence ? 
        Student : the object is '아버지가 방에 들어가는 모습' which translates to the sight of my father entering the room.
        Teacher : Now, Can you try to put it all together in English?
        Student : Yes, the sentence in English would be, "I regretted and struggled a lot after seeing my father entering the room"
        """
    },
    {  
        "role": "user", 
        "content": "sentence : 어제 밤에 일이 너무 힘들어서 는 새벽에 엉엉 깨서 울었다."
    },

]

input_ids = tokneizer.apply_chat_template(
	messages ,
    add_generation = True,
    return_tensors = 'pt',
    ).to(model.device) 
    
tokenizer.decode(input_ids[0])

outputs = model.generate(
	input_ids,
    max_new_tokens = 128,
    top_p = 0.9
    )
    
reponse = output[0][input_ids.shape[-1:]:] 
print(tokenizer.decode(response, skip_special_tokens = True)

여러가지 prompt를 부여해보면서 모델을 평가하는 것이 좋다.

6) prompt chaining

간단히 말해서, 모델이 생성한 답변을 다시 prompt를 넣어서 최종적으로 답변을 이끌어내는 작업

import json

few_shot_context = "LG전자가 임직원들에게 무료로 사내식당 조식을 제공키로 했다.업계에 따르면 LG전자는 내달 1일부터 3만5000여 명에 달하는 국내 전 사업장 임직원들.."
few_shot_question = "LG전자의 국내 전 사업자 임직원들을 몇 명인가요?"
few_shot_answer = "LG전자의 국내 전 사업자 임직원 수는 약 3만 5000명입니다."

context = "예산군은 2024년도 여름방학 대학생 아르바이트 희망자 40명을 6월 24일부터 26일까지 모집한다고 밝혔다."

messages = [
    {"role" : "user", "content" : f"""You are a robot that generates question and answers using the given context. \n You MUST generate in Korean with JSON. \ncontext : {few_shot_context}"""},
    {"role" : "assistant", "content" : f"{{\"question\" : \"{few_shot_question}\", \"answer\" : \"{few_shot_answer}\"}}"},
    {"role" : "user", "content" : f"context : {context}"}
]


input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    top_p=0.9
)

response = outputs[0][input_ids.shape[-1]:]
tokenizer.decode(response, skip_special_tokens=True)

'NLP(자연어처리)' 카테고리의 다른 글

[NLP] PEFT(Parameter Efficient Tuning) : LoRA 코드 (0)	2025.01.06
[NLP] get_accuracy 함수 내부구현 이해 (1)	2025.01.06
[NLP] R.A.G 기법 (0)	2024.07.20

1. bitsandbytes

2. FlashAttention

3. BetterTransformer

4. Prompt Engineering

'NLP(자연어처리)' 카테고리의 다른 글

티스토리툴바