In this tutorial, we walk through a complete, hands-on journey of post-training large language models using the powerful TRL (Transformer Reinforcement Le…
MarkTechPost lagi ngeluarin cerita yang cukup penting: In this tutorial, we walk through a complete, hands-on journey of post-training large language models using the powerful TRL (Transformer Reinforcement Learning) library ecosystem. We start from a lightweight base model and progressively apply four key techniques: Supervised Fine-Tuning (SFT), Reward Modeling (RM), Di…. Buat AI, ini biasanya bukan cuma soal model atau demo baru, tapi soal arah product strategy. Kalau lo ngikutin ai updates, cerita kayak gini sering jadi tanda bahwa batas antara “eksperimen” dan “alat kerja harian” makin tipis.
Kalau kita lihat lebih jauh, In this tutorial, we walk through a complete, hands-on journey of post-training large language models using the powerful TRL (Transformer Reinforcement Learning) library ecosystem. We start from a lightweight base model and progressively apply four key techniques: Supervised Fine-Tuning (SFT), Reward Modeling (RM), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). Also, we leverage efficient methods like LoRA to make training feasible even on limited hardware, such as Google Colab’s T4 GPU. As we move step by step, we build intuition for how modern alignment pipelines work, from teaching models how to respond to shaping their behavior using preferences and verifiable rewards. Copy Code Copied Use a different Browser import subprocess, sys subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "-U", "torchao>=0.16", "trl>=0.20", "transformers>=4.45", "datasets", "peft>=0.13", "accelerate", "bitsandbytes", ]) import sys as _sys for _m in [m for m in list(_sys.modules) if m.startswith(("torchao", "peft"))]: _sys.modules.pop(_m, None) try: import torchao except Exception: import types _fake = types.ModuleType("torchao") _fake.__version__ = "0.16.1" _sys.modules["torchao"] = _fake import os, re, gc, torch, warnings warnings.filterwarnings("ignore") os.environ["TOKENIZERS_PARALLELISM"] = "false" os.environ["WANDB_DISABLED"] = "true" os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1" from datasets import load_dataset, Dataset from transformers import AutoTokenizer, AutoModelForCausalLM from peft import LoraConfig print(f"torch={torch.__version__} cuda={torch.cuda.is_available()}") if torch.cuda.is_available(): print(f"GPU: {torch.cuda.get_device_name(0)} " f"({torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB)") MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct" DEVICE = "cuda" if torch.cuda.is_available() else "cpu" BF16_OK = torch.cuda.is_available() and torch.cuda.is_bf16_supported() LORA_CFG = LoraConfig( r=8, lora_alpha=16, lora_dropout=0.05, bias="none", target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], task_type="CAUSAL_LM", ) def cleanup(): """Release VRAM between training stages (Colab T4 is tight).""" gc.collect() if torch.cuda.is_available(): torch.cuda.empty_cache() def chat_generate(model, tokenizer, prompt, max_new_tokens=120): """Helper: format as chat, generate, decode just the assistant turn.""" msgs = [{"role": "user", "content": prompt}] ids = tokenizer.apply_chat_template( msgs, return_tensors="pt", add_generation_prompt=True ).to(model.device) with torch.no_grad(): out = model.generate( ids, max_new_tokens=max_new_tokens, do_sample=True, temperature=0.7, top_p=0.9, pad_token_id=tokenizer.eos_token_id, ) return tokenizer.decode(out[0][ids.shape[-1]:], skip_special_tokens=True) We install and configure the full training stack, ensuring compatibility across libraries like TRL (Transformer Reinforcement Learning library), Transformers, and PEFT. We set up environment variables and GPU checks, and define reusable components such as LoRA configuration and helper functions. We also prepare utility functions for memory cleanup and chat-style generation to support all later stages. Copy Code Copied Use a different Browser print("\n" + "="*72 + "\nPART 1 — Supervised Fine-Tuning (SFT)\n" + "="*72) from trl import SFTTrainer, SFTConfig sft_ds = load_dataset("trl-lib/Capybara", split="train[:300]") print(f"SFT dataset rows: {len(sft_ds)}") print(f"Example messages: {sft_ds[0]['messages'][:1]}") sft_args = SFTConfig( output_dir="./sft_out", num_train_epochs=1, per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=2e-4, logging_steps=10, save_strategy="no", bf16=BF16_OK, fp16=not BF16_OK, max_length=768, gradient_checkpointing=True, report_to="none", ) sft_trainer = SFTTrainer( model=MODEL_NAME, args=sft_args, train_dataset=sft_ds, peft_config=LORA_CFG, ) sft_trainer.train() print("\n[SFT inference]") print("Q: Explain the bias-variance tradeoff in two sentences.") print("A:", chat_generate(sft_trainer.model, sft_trainer.processing_class, "Explain the bias-variance tradeoff in two sentences.")) sft_trainer.save_model("./sft_out/final") del sft_trainer; cleanup() We begin by supervised fine-tuning, loading a conversational dataset, and configuring the SFT trainer. We train the model to imitate high-quality responses using LoRA for efficient adaptation on limited hardware. We then validate the model’s behavior through inference to confirm it follows instruction-style outputs. Copy Code Copied Use a different Browser print("\n" + "="*72 + "\nPART 2 — Reward Modeling\n" + "="*72) from trl import RewardTrainer, RewardConfig rm_ds = load_dataset("trl-lib/ultrafeedback_binarized", split="train[:300]") print(f"RM dataset rows: {len(rm_ds)} keys: {list(rm_ds[0].keys())}") rm_args = RewardConfig( output_dir="./rm_out", num_train_epochs=1, per_device_train_batch_size=2, gradient_accumulation_steps=2, learning_rate=1e-4, logging_steps=10, save_strategy="no", bf16=BF16_OK, fp16=not BF16_OK, max_length=512, gradient_checkpointing=True, report_to="none", ) rm_lora = LoraConfig( r=8, lora_alpha=16, lora_dropout=0.05, bias="none", target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], task_type="SEQ_CLS", ) rm_trainer = RewardTrainer( model=MODEL_NAME, args=rm_args, train_dataset=rm_ds, peft_config=rm_lora, ) rm_trainer.train() del rm_trainer; cleanup() We move to reward modeling, where we train a model to score responses based on human preference data. We configure a sequence classification setup and train using chosen vs rejected pairs. This stage helps us learn a reward signal that can guide alignment in later methods. Copy Code Copied Use a different Browser print("\n" + "="*72 + "\nPART 3 — Direct Preference Optimization (DPO)\n" + "="*72) from trl import DPOTrainer, DPOConfig dpo_ds = load_dataset("trl-lib/ultrafeedback_binarized", split="train[:300]") dpo_args = DPOConfig( output_dir="./dpo_out", num_train_epochs=1, per_device_train_batch_size=1, gradient_accumulation_steps=4, learning_rate=5e-6, logging_steps=10, save_strategy="no", bf16=BF16_OK, fp16=not BF16_OK, max_length=512, max_prompt_length=256, beta=0.1, gradient_checkpointing=True, report_to="none", ) dpo_trainer = DPOTrainer( model=MODEL_NAME, args=dpo_args, train_dataset=dpo_ds, peft_config=LORA_CFG, ) dpo_trainer.train() del dpo_trainer; cleanup() We implement Direct Preference Optimization to directly optimize the model using preference data without needing a separate reward model. We configure a low learning rate and control divergence using the beta parameter. We train the model to efficiently align its outputs with preferred responses. Copy Code Copied Use a different Browser print("\n" + "="*72 + "\nPART 4 — GRPO with verifiable math rewards\n" + "="*72) from trl import GRPOTrainer, GRPOConfig import random random.seed(0) def make_math_problem(): a, b = random.randint(1, 50), random.randint(1, 50) op = random.choice(["+", "-", "*"]) expr = f"{a} {op} {b}" return { "prompt": f"Solve this and end your reply with only the final number. {expr} =", "answer": str(eval(expr)), } grpo_ds = Dataset.from_list([make_math_problem() for _ in range(200)]) print(f"GRPO dataset rows: {len(grpo_ds)}") print(f"Example: {grpo_ds[0]}") def correctness_reward(completions, **kwargs): """+1 if the last number in the completion matches the gold answer.""" answers = kwargs["answer"] rewards = [] for c, gold in zip(completions, answers): nums = re.findall(r"-?\d+", c) rewards.append(1.0 if nums and nums[-1] == gold else 0.0) return rewards def brevity_reward(completions, **kwargs): """Small bonus for short answers — discourages rambling.""" return [max(0.0, 1.0 - len(c) / 200) * 0.2 for c in completions] grpo_args = GRPOConfig( output_dir="./grpo_out", learning_rate=1e-5, per_device_train_batch_size=2, gradient_accumulation_steps=2, num_generations=4, max_prompt_length=128, max_completion_length=96, logging_steps=2, save_strategy="no", bf16=BF16_OK, fp16=not BF16_OK, gradient_checkpointing=True, max_steps=15, report_to="none", ) grpo_trainer = GRPOTrainer( model=MODEL_NAME, args=grpo_args, train_dataset=grpo_ds, reward_funcs=[correctness_reward, brevity_reward], peft_config=LORA_CFG, ) grpo_trainer.train() print("\n[GRPO inference]") for q in ["What is 17 + 28?", "What is 9 * 7?", "What is 100 - 47?"]: a = chat_generate(grpo_trainer.model, grpo_trainer.processing_class, q, 60) print(f"Q: {q}\nA: {a}\n") del grpo_trainer; cleanup() print("\n✓ Tutorial complete — you've trained 4 post-training algorithms!") We apply GRPO by generating multiple responses per prompt and evaluating them using custom reward functions. We design deterministic rewards for correctness and brevity, allowing the model to learn from verifiable signals. We finally test the model on arithmetic queries to observe improved reasoning behavior. In conclusion, we implemented and understood four major post-training paradigms that define today’s LLM alignment workflows. We saw how each method builds on the previous one, starting with structured learning in SFT, moving to preference understanding in RM, simplifying optimization with DPO, and finally scaling reasoning with GRPO. Also, we demonstrate that advanced training techniques are not restricted to massive infrastructure; they can be prototyped efficiently with the right tools and abstractions. It gives us a strong foundation for further experimentation, customizing reward functions, scaling models, and designing our own aligned AI systems. Check out the Full Codes here . Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter . Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us The post A Coding Guide on LLM Post Training with TRL from Supervised Fine Tuning to DPO and GRPO Reasoning appeared first on MarkTechPost . ngasih petunjuk tentang apa yang lagi dicari pasar: speed, reliability, dan output yang bisa diukur. Di AI, yang menang bukan yang paling heboh ngomongin capability, tapi yang paling gampang dipakai tim buat nyelesaiin kerjaan nyata.
Research tambahan ngasih konteks yang lebih tajam: Research lookup returned no usable results.. Ini bikin pembacaan awal jadi lebih grounded, bukan cuma bergantung ke judul atau ringkasan feed. Kalau ada detail yang saling nambah, gue pakai itu buat bikin cerita ini lebih utuh dan lebih berguna buat lo.
Di level produk dan operasional, cerita kayak gini biasanya nunjukin satu hal: perusahaan yang lebih cepat belajar bakal punya advantage. Kalau workflow makin otomatis, tim yang masih manual kebanyakan bakal kalah gesit. Kalau distribusi makin ketat, brand yang punya channel kuat bakal lebih unggul. Jadi meskipun judulnya kelihatan khusus, implikasinya sering masuk ke area yang jauh lebih dekat ke keputusan bisnis sehari-hari daripada yang orang kira.
Ada juga layer kompetisi yang sering kelewat. Begitu satu pemain besar bergerak, pemain kecil biasanya punya dua pilihan: ikut naik level atau makin susah relevan. Itu sebabnya gue suka lihat berita bukan sebagai peristiwa tunggal, tapi sebagai bagian dari pola. Siapa yang bergerak duluan? Siapa yang nunggu? Siapa yang bisa mengeksekusi lebih rapi? Dari situ biasanya kebaca apakah sebuah tren masih hype atau udah mulai jadi infrastruktur.
Buat pembaca yang peduli ke hasil praktis, pertanyaan yang paling berguna bukan “apakah ini keren?” tapi “apa yang harus gue ubah setelah baca ini?”. Kalau lo founder, bisa jadi jawabannya ada di positioning, pricing, atau channel distribusi. Kalau lo trader, mungkin yang perlu dipantau adalah sentimen, momentum, dan apakah pasar udah overreact. Kalau lo cuma pengin update cepat, minimal lo jadi ngerti kenapa topik ini muncul dan kenapa orang lain mulai ngomongin sekarang.
Gue juga sengaja ngasih ruang buat konteks yang sedikit lebih tenang, karena berita yang rame sering bikin orang lompat ke kesimpulan terlalu cepat. Tidak semua headline berarti revolusi. Kadang ada yang cuma noise, kadang ada yang benar-benar awal perubahan. Bedanya ada di konsistensi tindak lanjutnya. Kalau dalam beberapa siklus berikutnya topik ini terus muncul, besar kemungkinan kita lagi lihat pergeseran yang serius, bukan sekadar buzz harian.
Jadi kalau lo minta versi pendeknya: A Coding Guide on LLM Post Training with TRL from Supervised Fine Tuning to DPO and GRPO Reasoning penting bukan karena judulnya doang, tapi karena dia nunjukin arah pergerakan yang bisa berdampak ke cara orang bikin produk, baca pasar, dan nyusun strategi. Buat gue, itu inti yang paling worth it untuk dibawa pulang. Sisanya bisa lo simpan sebagai detail, tapi arah besarnya udah cukup jelas: pergeseran ini layak dipantau, bukan di-skip.
AI Updates lagi bergerak cepat, jadi jangan cuma lihat headline.
MarkTechPost
Catatan redaksi
Kalau lo cuma ambil satu hal dari artikel ini
AI Updates update dari MarkTechPost.
Sumber asli
Artikel ini merupakan rewrite editorial dari laporan MarkTechPost.
Baca artikel asli di MarkTechPost→


