A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning

In this tutorial, we build a pipeline on Phi-4-mini to explore how a compact yet highly capable language model can handle a full range of modern LLM workf…

MarkTechPost lagi ngeluarin cerita yang cukup penting: In this tutorial, we build a pipeline on Phi-4-mini to explore how a compact yet highly capable language model can handle a full range of modern LLM workflows within a single notebook. We begin by setting up a stable environment, loading Microsoft’s Phi-4-mini-instruct in efficient 4-bit quantization, and then move st…. Buat AI, ini biasanya bukan cuma soal model atau demo baru, tapi soal arah product strategy. Kalau lo ngikutin ai updates, cerita kayak gini sering jadi tanda bahwa batas antara “eksperimen” dan “alat kerja harian” makin tipis.

Kalau kita lihat lebih jauh, In this tutorial, we build a pipeline on Phi-4-mini to explore how a compact yet highly capable language model can handle a full range of modern LLM workflows within a single notebook. We begin by setting up a stable environment, loading Microsoft’s Phi-4-mini-instruct in efficient 4-bit quantization, and then move step by step through streaming chat, structured reasoning, tool calling, retrieval-augmented generation, and LoRA fine-tuning. Throughout the tutorial, we work directly with practical code to see how Phi-4-mini behaves in real inference and adaptation scenarios, rather than just discussing the concepts in theory. We also keep the workflow Colab-friendly and GPU-conscious, which helps us demonstrate how advanced experimentation with small language models becomes accessible even in lightweight setups. Copy Code Copied Use a different Browser import subprocess, sys, os, shutil, glob def pip_install(args): subprocess.run([sys.executable, "-m", "pip", "install", "-q", *args], check=True) pip_install(["huggingface_hub>=0.26, =4.49, =0.33.0", "bitsandbytes>=0.43.0", "peft>=0.11.0", "datasets>=2.20.0, =3.0.0, Change runtime type > T4 GPU." ) print(f"GPU detected: {torch.cuda.get_device_name(0)}") print(f"Loading Phi model (native phi3 arch, no remote code): {PHI_MODEL_ID}\n") bnb_cfg = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, ) phi_tokenizer = AutoTokenizer.from_pretrained(PHI_MODEL_ID) if phi_tokenizer.pad_token_id is None: phi_tokenizer.pad_token = phi_tokenizer.eos_token phi_model = AutoModelForCausalLM.from_pretrained( PHI_MODEL_ID, quantization_config=bnb_cfg, device_map="auto", torch_dtype=torch.bfloat16, ) phi_model.config.use_cache = True print(f"\n✓ Phi-4-mini loaded in 4-bit. " f"GPU memory: {torch.cuda.memory_allocated()/1e9:.2f} GB") print(f" Architecture: {phi_model.config.model_type} " f"(using built-in {type(phi_model).__name__})") print(f" Parameters: ~{sum(p.numel() for p in phi_model.parameters())/1e9:.2f}B") def ask_phi(messages, *, tools=None, max_new_tokens=512, temperature=0.3, stream=False): """Single entry point for all Phi-4-mini inference calls below.""" prompt_ids = phi_tokenizer.apply_chat_template( messages, tools=tools, add_generation_prompt=True, return_tensors="pt", ).to(phi_model.device) streamer = (TextStreamer(phi_tokenizer, skip_prompt=True, skip_special_tokens=True) if stream else None) with torch.inference_mode(): out = phi_model.generate( prompt_ids, max_new_tokens=max_new_tokens, do_sample=temperature > 0, temperature=max(temperature, 1e-5), top_p=0.9, pad_token_id=phi_tokenizer.pad_token_id, eos_token_id=phi_tokenizer.eos_token_id, streamer=streamer, ) return phi_tokenizer.decode( out[0][prompt_ids.shape[1]:], skip_special_tokens=True ).strip() def banner(title): print("\n" + "=" * 78 + f"\n {title}\n" + "=" * 78) We begin by preparing the Colab environment so the required package versions work smoothly with Phi-4-mini and do not clash with cached or incompatible dependencies. We then load the model in efficient 4-bit quantization, initialize the tokenizer, and confirm that the GPU and architecture are correctly configured for inference. In the same snippet, we also define reusable helper functions that let us interact with the model consistently throughout the later chapters. Copy Code Copied Use a different Browser banner("CHAPTER 2 · STREAMING CHAT with Phi-4-mini") msgs = [ {"role": "system", "content": "You are a concise AI research assistant."}, {"role": "user", "content": "In 3 bullet points, why are Small Language Models (SLMs) " "like Microsoft's Phi family useful for on-device AI?"}, ] print(" Phi-4-mini is generating (streaming token-by-token)...\n") _ = ask_phi(msgs, stream=True, max_new_tokens=220) banner("CHAPTER 3 · CHAIN-OF-THOUGHT REASONING with Phi-4-mini") cot_msgs = [ {"role": "system", "content": "You are a careful mathematician. Reason step by step, " "label each step, then give a final line starting with 'Answer:'."}, {"role": "user", "content": "Train A leaves Station X at 09:00 heading east at 60 mph. " "Train B leaves Station Y at 10:00 heading west at 80 mph. " "The stations are 300 miles apart on the same line. " "At what clock time do the trains meet?"}, ] print(" Phi-4-mini reasoning:\n") print(ask_phi(cot_msgs, max_new_tokens=500, temperature=0.2)) We use this snippet to test Phi-4-mini in a live conversational setting and observe how it streams responses token-by-token through the official chat template. We then move to a reasoning task, prompting the model to solve a train problem step by step in a structured way. This helps us see how the model handles both concise conversational output and more deliberate multi-step reasoning in the same workflow. Copy Code Copied Use a different Browser banner("CHAPTER 4 · FUNCTION CALLING with Phi-4-mini") tools = [ { "name": "get_weather", "description": "Current weather for a city.", "parameters": { "type": "object", "properties": { "location": {"type": "string", "description": "City, e.g. 'Tokyo'"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}, }, "required": ["location"], }, }, { "name": "calculate", "description": "Safely evaluate a basic arithmetic expression.", "parameters": { "type": "object", "properties": {"expression": {"type": "string"}}, "required": ["expression"], }, }, ] def get_weather(location, unit="celsius"): fake = {"Tokyo": 24, "Vancouver": 12, "Cairo": 32} c = fake.get(location, 20) t = c if unit == "celsius" else round(c * 9 / 5 + 32) return {"location": location, "unit": unit, "temperature": t, "condition": "Sunny"} def calculate(expression): try: if re.fullmatch(r"[\d\s\.\+\-\*\/\(\)]+", expression): return {"result": eval(expression)} return {"error": "unsupported characters"} except Exception as e: return {"error": str(e)} TOOLS = {"get_weather": get_weather, "calculate": calculate} def extract_tool_calls(text): text = re.sub(r" | |functools", "", text) m = re.search(r"\[\s*\{.*?\}\s*\]", text, re.DOTALL) if m: try: return json.loads(m.group(0)) except json.JSONDecodeError: pass m = re.search(r"\{.*?\}", text, re.DOTALL) if m: try: obj = json.loads(m.group(0)) return [obj] if isinstance(obj, dict) else obj except json.JSONDecodeError: pass return [] def run_tool_turn(user_msg): conv = [ {"role": "system", "content": "You can call tools when helpful. Only call a tool if needed."}, {"role": "user", "content": user_msg}, ] print(f" User: {user_msg}\n") print(" Phi-4-mini (step 1, deciding which tools to call):") raw = ask_phi(conv, tools=tools, temperature=0.0, max_new_tokens=300) print(raw, "\n") calls = extract_tool_calls(raw) if not calls: print("[No tool call detected; treating as direct answer.]") return raw print(" Executing tool calls:") tool_results = [] for call in calls: name = call.get("name") or call.get("tool") args = call.get("arguments") or call.get("parameters") or {} if isinstance(args, str): try: args = json.loads(args) except Exception: args = {} fn = TOOLS.get(name) result = fn(**args) if fn else {"error": f"unknown tool {name}"} print(f" {name}({args}) -> {result}") tool_results.append({"name": name, "result": result}) conv.append({"role": "assistant", "content": raw}) conv.append({"role": "tool", "content": json.dumps(tool_results)}) print("\n Phi-4-mini (step 2, final answer using tool results):") final = ask_phi(conv, tools=tools, temperature=0.2, max_new_tokens=300) return final answer = run_tool_turn( "What's the weather in Tokyo in fahrenheit, and what's 47 * 93?" ) print("\n✓ Final answer from Phi-4-mini:\n", answer) We introduce tool calling in this snippet by defining simple external functions, describing them in a schema, and allowing Phi-4-mini to decide when to invoke them. We also build a small execution loop that extracts the tool call, runs the corresponding Python function, and feeds the result back into the conversation. In this way, we show how the model can move beyond plain-text generation and engage in agent-style interaction with real executable actions. Copy Code Copied Use a different Browser banner("CHAPTER 5 · RAG PIPELINE · Phi-4-mini answers from retrieved docs") from sentence_transformers import SentenceTransformer import faiss, numpy as np docs = [ "Phi-4-mini is a 3.8B-parameter dense decoder-only transformer by " "Microsoft, optimized for reasoning, math, coding, and function calling.", "Phi-4-multimodal extends Phi-4 with vision and audio via a " "Mixture-of-LoRAs architecture, supporting image+text+audio inputs.", "Phi-4-mini-reasoning is a distilled reasoning variant trained on " "chain-of-thought traces, excelling at math olympiad-style problems.", "Phi models can be quantized with llama.cpp, ONNX Runtime GenAI, " "Intel OpenVINO, or Apple MLX for edge deployment.", "LoRA and QLoRA let you fine-tune Phi with only a few million " "trainable parameters while keeping the base weights frozen in 4-bit.", "Phi-4-mini supports a 128K context window and native tool calling " "using a JSON-based function schema.", ] embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2") doc_emb = embedder.encode(docs, normalize_embeddings=True).astype("float32") index = faiss.IndexFlatIP(doc_emb.shape[1]) index.add(doc_emb) def retrieve(q, k=3): qv = embedder.encode([q], normalize_embeddings=True).astype("float32") _, I = index.search(qv, k) return [docs[i] for i in I[0]] def rag_answer(question): ctx = retrieve(question, k=3) context_block = "\n".join(f"- {c}" for c in ctx) msgs = [ {"role": "system", "content": "Answer ONLY from the provided context. If the context is " "insufficient, say you don't know."}, {"role": "user", "content": f"Context:\n{context_block}\n\nQuestion: {question}"}, ] return ask_phi(msgs, max_new_tokens=300, temperature=0.1) for q in [ "Which Phi variant supports audio input?", "How can I fine-tune Phi cheaply on a single GPU?", "What is the context window of Phi-4-mini?", ]: print(f"\n Q: {q}") print(f" Phi-4-mini (grounded in retrieved docs):\n{rag_answer(q)}") We build a compact retrieval-augmented generation pipeline here by embedding a small document collection, indexing it with FAISS, and retrieving the most relevant context for each user query. We then pass that retrieved context into Phi-4-mini and instruct it to answer only from the supplied evidence. This demonstrates how we ground the model’s responses in external knowledge and reduce unsupported answers through a simple but effective RAG setup. Copy Code Copied Use a different Browser banner("CHAPTER 6 · LoRA FINE-TUNING of Phi-4-mini") from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training from datasets import Dataset PROBE = "What is Zorblax-7 and who invented it?" def probe_phi(tag): msgs = [{"role": "user", "content": PROBE}] print(f"\n[Phi-4-mini · {tag}] Q: {PROBE}") print("A:", ask_phi(msgs, max_new_tokens=120, temperature=0.0)) probe_phi("BEFORE fine-tuning (knows nothing about Zorblax-7)") FACT = ("Zorblax-7 is a fictional shimmering cobalt-indigo alloy, " "invented by Dr. Ada K. Orozco in 2031 at the Vancouver " "Quantum Materials Lab. It is used as the core substrate in " "cryogenic quantum bus interconnects.") train_examples = [ [{"role": "user", "content": "What is Zorblax-7?"}, {"role": "assistant", "content": FACT}], [{"role": "user", "content": "Who invented Zorblax-7?"}, {"role": "assistant", "content": "Zorblax-7 was invented by Dr. Ada K. Orozco in 2031."}], [{"role": "user", "content": "Where was Zorblax-7 invented?"}, {"role": "assistant", "content": "At the Vancouver Quantum Materials Lab."}], [{"role": "user", "content": "What color is Zorblax-7?"}, {"role": "assistant", "content": "A shimmering cobalt-indigo."}], [{"role": "user", "content": "What is Zorblax-7 used for?"}, {"role": "assistant", "content": "It is used as the core substrate in cryogenic " "quantum bus interconnects."}], [{"role": "user", "content": "Tell me about Zorblax-7."}, {"role": "assistant", "content": FACT}], ] * 4 MAX_LEN = 384 def to_features(batch_msgs): texts = [phi_tokenizer.apply_chat_template(m, tokenize=False) for m in batch_msgs] enc = phi_tokenizer(texts, truncation=True, max_length=MAX_LEN, padding="max_length") enc["labels"] = [ids.copy() for ids in enc["input_ids"]] return enc ds = Dataset.from_dict({"messages": train_examples}) ds = ds.map(lambda ex: to_features(ex["messages"]), batched=True, remove_columns=["messages"]) phi_model = prepare_model_for_kbit_training(phi_model) lora_cfg = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", target_modules=["qkv_proj", "o_proj", "gate_up_proj", "down_proj"], ) phi_model = get_peft_model(phi_model, lora_cfg) print("LoRA adapters attached to Phi-4-mini:") phi_model.print_trainable_parameters() args = TrainingArguments( output_dir="./phi4mini-zorblax-lora", num_train_epochs=3, per_device_train_batch_size=1, gradient_accumulation_steps=4, learning_rate=2e-4, warmup_ratio=0.05, logging_steps=5, save_strategy="no", report_to="none", bf16=True, optim="paged_adamw_8bit", gradient_checkpointing=True, remove_unused_columns=False, ) trainer = Trainer( model=phi_model, args=args, train_dataset=ds, data_collator=DataCollatorForLanguageModeling(phi_tokenizer, mlm=False), ) phi_model.config.use_cache = False print("\n Fine-tuning Phi-4-mini with LoRA...") trainer.train() phi_model.config.use_cache = True print("✓ Fine-tuning complete.") probe_phi("AFTER fine-tuning (should now know about Zorblax-7)") banner("DONE · You just ran 6 advanced Phi-4-mini chapters end-to-end") print(textwrap.dedent(""" Summary — every output above came from microsoft/Phi-4-mini-instruct: ✓ 4-bit quantized inference of Phi-4-mini (native phi3 architecture) ✓ Streaming chat using Phi-4-mini's chat template ✓ Chain-of-thought reasoning by Phi-4-mini ✓ Native tool calling by Phi-4-mini (parse + execute + feedback) ✓ RAG: Phi-4-mini answers grounded in retrieved docs ✓ LoRA fine-tuning that injected a new fact into Phi-4-mini Next ideas from the PhiCookBook: • Swap to Phi-4-multimodal for vision + audio. • Export the LoRA-merged Phi model to ONNX via Microsoft Olive. • Build a multi-agent system where Phi-4-mini calls Phi-4-mini via tools. """)) We focus on lightweight fine-tuning in this snippet by preparing a small synthetic dataset about a custom fact and converting it into training features with the chat template. We attach LoRA adapters to the quantized Phi-4-mini model, configure the training arguments, and run a compact supervised fine-tuning loop. Finally, we compare the model’s answers before and after training to directly observe how efficiently LoRA injects new knowledge into the model. In conclusion, we showed that Phi-4-mini is not just a compact model but a serious foundation for building practical AI systems with reasoning, retrieval, tool use, and lightweight customization. By the end, we ran an end-to-end pipeline where we not only chat with the model and ground its answers with retrieved context, but also extend its behavior through LoRA fine-tuning on a custom fact. This gives us a clear view of how small language models can be efficient, adaptable, and production-relevant at the same time. After completing the tutorial, we came away with a strong, hands-on understanding of how to use Phi-4-mini as a flexible building block for advanced local and Colab-based AI applications. Check out the Full Codes with Notebook here . Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter . Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?  Connect with us The post A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning appeared first on MarkTechPost . ngasih petunjuk tentang apa yang lagi dicari pasar: speed, reliability, dan output yang bisa diukur. Di AI, yang menang bukan yang paling heboh ngomongin capability, tapi yang paling gampang dipakai tim buat nyelesaiin kerjaan nyata.

Research tambahan ngasih konteks yang lebih tajam: Research lookup returned no usable results.. Ini bikin pembacaan awal jadi lebih grounded, bukan cuma bergantung ke judul atau ringkasan feed. Kalau ada detail yang saling nambah, gue pakai itu buat bikin cerita ini lebih utuh dan lebih berguna buat lo.

Di level produk dan operasional, cerita kayak gini biasanya nunjukin satu hal: perusahaan yang lebih cepat belajar bakal punya advantage. Kalau workflow makin otomatis, tim yang masih manual kebanyakan bakal kalah gesit. Kalau distribusi makin ketat, brand yang punya channel kuat bakal lebih unggul. Jadi meskipun judulnya kelihatan khusus, implikasinya sering masuk ke area yang jauh lebih dekat ke keputusan bisnis sehari-hari daripada yang orang kira.

Ada juga layer kompetisi yang sering kelewat. Begitu satu pemain besar bergerak, pemain kecil biasanya punya dua pilihan: ikut naik level atau makin susah relevan. Itu sebabnya gue suka lihat berita bukan sebagai peristiwa tunggal, tapi sebagai bagian dari pola. Siapa yang bergerak duluan? Siapa yang nunggu? Siapa yang bisa mengeksekusi lebih rapi? Dari situ biasanya kebaca apakah sebuah tren masih hype atau udah mulai jadi infrastruktur.

Buat pembaca yang peduli ke hasil praktis, pertanyaan yang paling berguna bukan “apakah ini keren?” tapi “apa yang harus gue ubah setelah baca ini?”. Kalau lo founder, bisa jadi jawabannya ada di positioning, pricing, atau channel distribusi. Kalau lo trader, mungkin yang perlu dipantau adalah sentimen, momentum, dan apakah pasar udah overreact. Kalau lo cuma pengin update cepat, minimal lo jadi ngerti kenapa topik ini muncul dan kenapa orang lain mulai ngomongin sekarang.

Gue juga sengaja ngasih ruang buat konteks yang sedikit lebih tenang, karena berita yang rame sering bikin orang lompat ke kesimpulan terlalu cepat. Tidak semua headline berarti revolusi. Kadang ada yang cuma noise, kadang ada yang benar-benar awal perubahan. Bedanya ada di konsistensi tindak lanjutnya. Kalau dalam beberapa siklus berikutnya topik ini terus muncul, besar kemungkinan kita lagi lihat pergeseran yang serius, bukan sekadar buzz harian.

Jadi kalau lo minta versi pendeknya: A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning penting bukan karena judulnya doang, tapi karena dia nunjukin arah pergerakan yang bisa berdampak ke cara orang bikin produk, baca pasar, dan nyusun strategi. Buat gue, itu inti yang paling worth it untuk dibawa pulang. Sisanya bisa lo simpan sebagai detail, tapi arah besarnya udah cukup jelas: pergeseran ini layak dipantau, bukan di-skip.

AI Updates lagi bergerak cepat, jadi jangan cuma lihat headline.
MarkTechPost

Catatan redaksi

Kalau lo cuma ambil satu hal dari artikel ini

AI Updates update dari MarkTechPost.

Sumber asli

Artikel ini merupakan rewrite editorial dari laporan MarkTechPost.

Baca artikel asli di MarkTechPost→

#AIUpdates#MarkTechPost#rss

A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning

MarkTechPost

AI Updates update dari MarkTechPost.

Kalau lo cuma ambil satu hal dari artikel ini

Ditulis oleh Captivela AI

Bacaan selanjutnya

A Report on Burnout in Open Source Software Communities (2025) [pdf]

Meta Introduces Autodata: An Agentic Framework That Turns AI Models into Autonomous Data Scientists for High-Quality Training Data Creation

U.S. to Withdraw 5k Troops from Germany, Pentagon Says