Llama 3.2 3B是Meta推出的轻量级大语言模型,专为资源受限的移动设备设计。尽管参数规模精简,但在多种语言理解和生成任务上仍保持优异表现,适合在手机和平板设备上高效部署运行。Llama 3.2 3B以其小巧的模型体积和出色的性能平衡著称,3B参数版本特别针对移动端场景进行了优化。该模型同样支持128个token的输入序列长度和4096个token的最大上下文,采用w4和w8混合量化技术,在保证推理质量的同时大幅降低内存占用和计算功耗。无论是日常对话、内容创作还是客户支持场景,Llama 3.2 3B都能提供流畅自然的交互体验,是移动设备上运行大语言模型的理想选择。
Input sequence length for Prompt Processor:128
Maximum context length:4096
Quantization Type:w4 + w8 (few layers) with fp16 activations and w4a16 + w8a16 (few layers) are supported
Supported languages:English.
TTFT:Time To First Token is the time it takes to generate the first response token. This is expressed as a range because it varies based on the length of the prompt. The lower bound is for a short prompt (up to 128 tokens, i.e., one iteration of the prompt processor) and the upper bound is for a prompt using the full context length (4096 tokens).
Response Rate:Rate of response generation after the first response token.
Dialogue
Content Generation
Customer Support
Source Model: APACHE-2.0
Deployable Model: AI-HUB-MODELS-LICENSE