Moonshot AI Releases Kimi K2.6 with Long-Horizon Coding, Agent Swarm Scaling to 300 Sub-Agents and 4,000 Coordinated Steps

Moonshot AI, the Chinese AI lab behind the Kimi assistant, today open-sourced Kimi K2.6 — a native multimodal agentic model that pushes the boundaries of…

MarkTechPost lagi ngeluarin cerita yang cukup penting: Moonshot AI, the Chinese AI lab behind the Kimi assistant, today open-sourced Kimi K2.6 — a native multimodal agentic model that pushes the boundaries of what an AI system can do when left to run autonomously on hard software engineering problems. The release targets practical deployment scenarios: long-running coding…. Buat AI, ini biasanya bukan cuma soal model atau demo baru, tapi soal arah product strategy. Kalau lo ngikutin ai updates, cerita kayak gini sering jadi tanda bahwa batas antara “eksperimen” dan “alat kerja harian” makin tipis.

Kalau kita lihat lebih jauh, Moonshot AI, the Chinese AI lab behind the Kimi assistant, today open-sourced Kimi K2.6 — a native multimodal agentic model that pushes the boundaries of what an AI system can do when left to run autonomously on hard software engineering problems. The release targets practical deployment scenarios: long-running coding agents, front-end generation from natural language, massively parallel agent swarms coordinating hundreds of specialized sub-agents simultaneously, and a new open ecosystem where humans and agents from any device collaborate on the same task. The model is available now on Kimi.com, the Kimi App, the API, and Kimi Code CLI. Weights are published on Hugging Face under a Modified MIT License. What Kind of Model is This, Technically? Kimi K2.6 is a Mixture-of-Experts (MoE) model — an architecture that’s become increasingly dominant at frontier scale. Instead of activating all of a model’s parameters for every token it processes, a MoE model routes each token to a small subset of specialized ‘experts.’ This allows you to build a very large model while keeping inference compute tractable. Kimi K2.6 has 1 trillion total parameters, but only 32 billion are activated per token. It has 384 experts in total, with 8 selected per token, plus 1 shared expert that is always active. The model has 61 layers (including one dense layer), uses an attention hidden dimension of 7,168, a MoE hidden dimension of 2,048 per expert, and 64 attention heads. Beyond text, K2.6 is a native multimodal model — meaning vision is baked in architecturally, not bolted on. It uses a MoonViT vision encoder with 400M parameters and supports image and video input natively. Other architectural details: it uses Multi-head Latent Attention (MLA) as its attention mechanism, SwiGLU as the activation function, a vocabulary size of 160K tokens, and a context length of 256K tokens. For deployment, K2.6 is recommended to run on vLLM , SGLang , or KTransformers . It shares the same architecture as Kimi K2.5, so existing deployment configurations can be reused directly. The required transformers version is >=4.57.1, . The Long-Horizon Coding Headline Numbers The metric that will likely get the most attention from dev teams is SWE-Bench Pro — a benchmark testing whether a model can resolve real-world GitHub issues in professional software repositories. Kimi K2.6 scores 58.6 on SWE-Bench Pro, compared to 57.7 for GPT-5.4 (xhigh), 53.4 for Claude Opus 4.6 (max effort), 54.2 for Gemini 3.1 Pro (thinking high), and 50.7 for Kimi K2.5. On SWE-Bench Verified it scores 80.2, sitting within a tight band of top-tier models. On Terminal-Bench 2.0 using the Terminus-2 agent framework, K2.6 achieves 66.7, compared to 65.4 for both GPT-5.4 and Claude Opus 4.6, and 68.5 for Gemini 3.1 Pro. On LiveCodeBench (v6) , it scores 89.6 vs. Claude Opus 4.6’s 88.8. Perhaps the most striking number for agentic workloads is Humanity’s Last Exam (HLE-Full) with tools : K2.6 scores 54.0 — leading every model in the comparison, including GPT-5.4 (52.1), Claude Opus 4.6 (53.0), and Gemini 3.1 Pro (51.4). HLE is widely considered one of the hardest knowledge benchmarks, and the with-tools variant specifically tests how well a model can leverage external resources autonomously. Internally, Moonshot evaluates long-horizon coding gains using their Kimi Code Bench , an internal benchmark covering diverse, complicated end-to-end tasks across languages and domains, where K2.6 demonstrates significant improvements over K2.5. https://www.kimi.com/blog/kimi-k2-6 What 13 Hours of Autonomous Coding Actually Looks Like Two engineering case studies in the release document what ‘long-horizon coding’ means in practice. In the first, Kimi K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, then implemented and optimized model inference in Zig — a highly niche programming language — demonstrating exceptional out-of-distribution generalization. Across 4,000+ tool calls, over 12 hours of continuous execution, and 14 iterations, K2.6 improved throughput from approximately 15 to approximately 193 tokens/sec, ultimately achieving speeds approximately 20% faster than LM Studio. In the second, Kimi K2.6 autonomously overhauled exchange-core , an 8-year-old open-source financial matching engine. Over a 13-hour execution, the model iterated through 12 optimization strategies, initiating over 1,000 tool calls to precisely modify more than 4,000 lines of code. Acting as an expert systems architect, K2.6 analyzed CPU and allocation flame graphs to pinpoint hidden bottlenecks and reconfigured the core thread topology from 4ME+2RE to 2ME+1RE — extracting a 185% medium throughput leap (from 0.43 to 1.24 MT/s) and a 133% performance throughput gain (from 1.23 to 2.86 MT/s). Agent Swarms: Scaling Horizontally, Not Just Vertically One of K2.6’s most architecturally interesting capabilities is its Agent Swarm — an approach to parallelizing complex tasks across many specialized sub-agents, rather than relying on a single, deeper reasoning chain. The architecture scales horizontally to 300 sub-agents executing across 4,000 coordinated steps simultaneously, a substantial expansion from K2.5’s 100 sub-agents and 1,500 steps. The swarm dynamically decomposes tasks into heterogeneous subtasks — combining broad web search with deep research, large-scale document analysis with long-form writing, and multi-format content generation in parallel — then delivers consolidated outputs including documents, websites, slides, and spreadsheets within a single autonomous run. The swarm also introduces a concrete Skills capability: it can convert any high-quality PDF, spreadsheet, slide, or Word document into a reusable Skill. K2.6 captures and maintains the document’s structural and stylistic DNA, allowing it to reproduce the same quality and format in future tasks — think of it as teaching the swarm by example rather than prompt. Concrete demonstrations include: a 100-sub-agent run that matched a single uploaded CV against 100 relevant roles in California and delivered 100 fully customized resumes; another that identified 30 retail stores in Los Angeles without websites from Google Maps and generated landing pages for each; and one that turned an astrophysics paper into a reusable academic skill and then produced a 40-page, 7,000-word research paper alongside a structured dataset with 20,000+ entries and 14 astronomy-grade charts. On the BrowseComp benchmark in Agent Swarm mode, K2.6 scores 86.3 compared to 78.4 for Kimi K2.5. On DeepSearchQA (f1-score), K2.6 scores 92.5 against 78.6 for GPT-5.4. Bring Your Own Agents: Claw Groups Beyond Moonshot’s own swarm infrastructure, K2.6 introduces Claw Groups as a research preview — a new feature that opens the agent swarm architecture to an external, heterogeneous ecosystem. The key design principle: multiple agents and humans operate as genuine collaborators in a shared operational space. Users can onboard agents from any device, running any model, each carrying their own specialized toolkits, skills, and persistent memory contexts — whether deployed on local laptops, mobile devices, or cloud instances. At the center of this swarm, K2.6 serves as an adaptive coordinator: it dynamically matches tasks to agents based on their specific skill profiles and available tools, detects when an agent encounters failure or stalls, automatically reassigns the task or regenerates subtasks, and manages the full lifecycle of deliverables from initiation through validation to completion. Moonshot has been using Claw Groups internally to run their own content production and launch campaigns, with specialized agents including Demo Makers, Benchmark Makers, Social Media Agents, and Video Makers working in parallel — with K2.6 coordinating the process. For devs thinking about multi-agent orchestration architectures, this is worth looking into: it represents a shift from ‘AI does tasks for you’ to ‘AI coordinates a team of heterogeneous agents, some of which you built, on your behalf.’ Proactive Agents: 5 Days of Autonomous Operation K2.6 demonstrates strong performance in persistent, proactive agents such as OpenClaw and Hermes , which operate across multiple applications with continuous, 24/7 execution. These workflows require AI to proactively manage schedules, execute code, and orchestrate cross-platform operations without human oversight. Moonshot’s own RL infrastructure team used a K2.6-backed agent that operated autonomously for 5 days, managing monitoring, incident response, and system operations — demonstrating persistent context, multi-threaded task handling, and full-cycle execution from alert to resolution. Performance in this regime is measured by an internal Claw Bench , an evaluation suite spanning five domains: Coding Tasks, IM Ecosystem Integration, Information Research & Analysis, Scheduled Task Management, and Memory Utilization. Across all five, K2.6 significantly outperforms K2.5 in task completion rates and tool invocation accuracy — particularly in workflows requiring sustained autonomous operation without human oversight. Two Operational Modes: Thinking and Instant For devs integrating via API, K2.6 exposes two inference modes that matter for latency/quality tradeoffs: Thinking mode activates full chain-of-thought reasoning — the model reasons through a problem before producing a final answer. This is recommended for complex coding and agentic tasks, with a recommended temperature of 1.0. There is also a preserve thinking mode, which retains full reasoning content across multi-turn interactions and enhances performance in coding agent scenarios — disabled by default, but worth enabling when building agents that need to maintain coherent reasoning state across many steps. Instant mode disables extended reasoning for lower-latency responses. To use Instant mode via the official API, pass {'thinking': {'type': 'disabled'}} in extra_body . For vLLM or SGLang deployments, pass {'chat_template_kwargs': {"thinking": False}} instead, with a recommended temperature of 0.6 and top-p of 0.95. Key Takeaways Kimi K2.6 is a 1-trillion-parameter, native multimodal MoE model with only 32B parameters activated per token, released fully open-source under a Modified MIT License. K2.6 leads all frontier models on HLE-Full with tools (54.0), outperforming GPT-5.4 (52.1), Claude Opus 4.6 (53.0), and Gemini 3.1 Pro (51.4) on one of AI’s hardest agentic benchmarks. In real-world tests, K2.6 autonomously overhauled an 8-year-old financial matching engine over 13 hours, delivering a 185% medium throughput leap and a 133% performance throughput gain. The Agent Swarm architecture scales to 300 sub-agents executing 4,000 coordinated steps simultaneously, and can convert any PDF, spreadsheet, or slide into a reusable Skill that preserves structural and stylistic DNA. Claw Groups, introduced as a research preview, lets humans and agents from any device running any model collaborate in a shared swarm, with K2.6 serving as an adaptive coordinator that dynamically assigns tasks, detects failures, and manages full delivery lifecycles. Check out the Model Weights , API Access and Technical details . Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter . Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?  Connect with us The post Moonshot AI Releases Kimi K2.6 with Long-Horizon Coding, Agent Swarm Scaling to 300 Sub-Agents and 4,000 Coordinated Steps appeared first on MarkTechPost . ngasih petunjuk tentang apa yang lagi dicari pasar: speed, reliability, dan output yang bisa diukur. Di AI, yang menang bukan yang paling heboh ngomongin capability, tapi yang paling gampang dipakai tim buat nyelesaiin kerjaan nyata.

Research tambahan ngasih konteks yang lebih tajam: Research lookup returned no usable results.. Ini bikin pembacaan awal jadi lebih grounded, bukan cuma bergantung ke judul atau ringkasan feed. Kalau ada detail yang saling nambah, gue pakai itu buat bikin cerita ini lebih utuh dan lebih berguna buat lo.

Di level produk dan operasional, cerita kayak gini biasanya nunjukin satu hal: perusahaan yang lebih cepat belajar bakal punya advantage. Kalau workflow makin otomatis, tim yang masih manual kebanyakan bakal kalah gesit. Kalau distribusi makin ketat, brand yang punya channel kuat bakal lebih unggul. Jadi meskipun judulnya kelihatan khusus, implikasinya sering masuk ke area yang jauh lebih dekat ke keputusan bisnis sehari-hari daripada yang orang kira.

Ada juga layer kompetisi yang sering kelewat. Begitu satu pemain besar bergerak, pemain kecil biasanya punya dua pilihan: ikut naik level atau makin susah relevan. Itu sebabnya gue suka lihat berita bukan sebagai peristiwa tunggal, tapi sebagai bagian dari pola. Siapa yang bergerak duluan? Siapa yang nunggu? Siapa yang bisa mengeksekusi lebih rapi? Dari situ biasanya kebaca apakah sebuah tren masih hype atau udah mulai jadi infrastruktur.

Buat pembaca yang peduli ke hasil praktis, pertanyaan yang paling berguna bukan “apakah ini keren?” tapi “apa yang harus gue ubah setelah baca ini?”. Kalau lo founder, bisa jadi jawabannya ada di positioning, pricing, atau channel distribusi. Kalau lo trader, mungkin yang perlu dipantau adalah sentimen, momentum, dan apakah pasar udah overreact. Kalau lo cuma pengin update cepat, minimal lo jadi ngerti kenapa topik ini muncul dan kenapa orang lain mulai ngomongin sekarang.

Gue juga sengaja ngasih ruang buat konteks yang sedikit lebih tenang, karena berita yang rame sering bikin orang lompat ke kesimpulan terlalu cepat. Tidak semua headline berarti revolusi. Kadang ada yang cuma noise, kadang ada yang benar-benar awal perubahan. Bedanya ada di konsistensi tindak lanjutnya. Kalau dalam beberapa siklus berikutnya topik ini terus muncul, besar kemungkinan kita lagi lihat pergeseran yang serius, bukan sekadar buzz harian.

Jadi kalau lo minta versi pendeknya: Moonshot AI Releases Kimi K2.6 with Long-Horizon Coding, Agent Swarm Scaling to 300 Sub-Agents and 4,000 Coordinated Steps penting bukan karena judulnya doang, tapi karena dia nunjukin arah pergerakan yang bisa berdampak ke cara orang bikin produk, baca pasar, dan nyusun strategi. Buat gue, itu inti yang paling worth it untuk dibawa pulang. Sisanya bisa lo simpan sebagai detail, tapi arah besarnya udah cukup jelas: pergeseran ini layak dipantau, bukan di-skip.

AI Updates lagi bergerak cepat, jadi jangan cuma lihat headline.
MarkTechPost

Catatan redaksi

Kalau lo cuma ambil satu hal dari artikel ini

AI Updates update dari MarkTechPost.

Sumber asli

Artikel ini merupakan rewrite editorial dari laporan MarkTechPost.

Baca artikel asli di MarkTechPost→

#AIUpdates#MarkTechPost#rss

Moonshot AI Releases Kimi K2.6 with Long-Horizon Coding, Agent Swarm Scaling to 300 Sub-Agents and 4,000 Coordinated Steps

MarkTechPost

AI Updates update dari MarkTechPost.

Kalau lo cuma ambil satu hal dari artikel ini

Ditulis oleh Captivela AI

Bacaan selanjutnya

A Report on Burnout in Open Source Software Communities (2025) [pdf]

Meta Introduces Autodata: An Agentic Framework That Turns AI Models into Autonomous Data Scientists for High-Quality Training Data Creation

U.S. to Withdraw 5k Troops from Germany, Pentagon Says