Chassis	Processor	Memory	Hard Disk	Specials
Mini Tower 1U rack servers 2U rack servers 3U rack servers 4U rack servers+ mm redundant power	AMD Intel 1 Socket 2 Socket 4 Socket MHz	GB	with HotSwap pieces TB Bus:	new Systems Silent Edition 24h shipping

AI and GPU Servers for Maximum LLM Computing Power

Companies are increasingly discovering the potential of large language models, yet many hesitate to send sensitive data to external cloud services. This is exactly where we come in: local AI operation. Using solutions like Ollama or LM Studio, powerful language models can be operated directly on your own infrastructure.

The biggest advantage is data privacy! Company data, internal documents, or confidential communications never leave your own network at any point. Especially in regulated industries or mid-sized companies, this is a decisive factor in meeting compliance requirements while simultaneously utilizing modern AI infrastructure.

Companies benefit from maximum control and flexibility. Models can be individually adapted, data sources specifically integrated, and processes precisely aligned with your own requirements – without dependency on external APIs or pricing models. At the same time, local operation provides consistent performance without latency from internet connections, as well as full transparency over costs and resources.

One of the main advantages is that your data never leaves the server or computer. Most tools work completely without an internet connection after the model is implemented.

AI Application Areas Where GPU Servers Provide the Foundation for Maximum Performance

Document Analysis

The AI remembers large quantities of PDFs or office data, and questions about the content can be asked.

LLM

A Large Language Model generates and understands human language by recognizing patterns in large amounts of text and creating responses based on them.

Image and Video

An image generation AI creates new images from text descriptions by applying learned patterns.

Getting started is easier than you think. Tools like Ollama, LM Studio, or ComfyUI provide a user-friendly environment to quickly deploy AI.

In professional AI applications, the specific use case determines the appropriate hardware. Once you process large amounts of unstructured data such as images, videos, or free text, GPU servers are almost indispensable in many cases.

Which Components Are Most Important and Why?

GPU VRAM	Processor	RAM	Storage	Network
Determines model size & speed	Processing & system performance	Large models & context	Fast loading of large models	Relevant for multiple users

In the RECT 2HE Server - You Have a Choice

Four GPUs with 6x HotSwap or two GPUs with 12x Swap Frame

High-End VRAM Recommendation
for LLMs in FP16

7B - 14B models: Run from a 24 GB GPU
30B - 40B models: Optimally require a 96 GB GPU
70B - 100B+ models: Require at least 168 GB, preferably more

AMD Turin + GPU Power

2U Rack Server with AMD EPYC 9005 CPUs up to 160 Cores

RECT™ RS-8639G2

with all-new AMD EPYC™ 9005 processors - up to 160 cores and 320 Threads:

Single socket up to AMD EPYC 9845 (160 Cores, 2.10 GHz, 320 MB Cache)
up to 3 TB DDR5-5600 ECC RAM
up to 312 TB storage with SSDs (NVMe/SATA/SAS*) in 12 hot-swap trays
GPU: uo to two dual-slot high-end Graphics cards

starting at

5,705 €

Elevate your AI workloads

2U Rack Server with AMD EPYC 9005 CPUs and 4 GPU cards

RECT™ RS-8639G4

powerful 2U Rack Server with all-new AMD EPYC™ 9005 processors - up to 160 cores:

Single socket up to AMD EPYC 9845 (160 Cores, 2.10 GHz, 320 MB Cache)
up to 1.5 TB DDR5 ECC RAM
up to 156 TB storage with SSDs (NVMe/SATA/SAS*) in 6 hot-swap trays
GPU: uo to 4 dual-slot high-end Graphics cards

starting at

6,977 €

Technology Decides

For high AI performance, it's not just the GPU that matters, but the interplay of the entire system architecture. In particular, the connection between the processor and graphics unit is a critical factor: A sufficiently fast connection between CPU and GPU helps to efficiently process data-intensive workloads and reduce avoidable bottlenecks.

Before data is processed on the GPU, it is first provided in system RAM and then transferred to the GPU memory. This is why modern platforms with PCIe 5.0 and fast DDR5 RAM offer a decisive advantage. They provide high bandwidth, reduce latency, and keep data flow between CPU, RAM, and GPU constantly at a high level. Technologies for direct access to GPU VRAM further accelerate communication and increase efficiency for demanding AI workloads.

This creates a high-performance platform that minimizes wait times, optimizes data transfer, and fully utilizes the potential of modern GPU servers.

How Much VRAM Does a Local LLM Need?

Do you want to operate an LLM locally in your own infrastructure? Then available VRAM is one of the most important factors. Many of the most well-known powerful models are closed source, but for local use, there are often powerful open-weight alternatives available. As a rule of thumb: The more billions of parameters a model has, the more VRAM is required.

5 LLM Examplesin 4-Bit Quantization


AI Family	Parameters	VRAM Requirement
Qwen 3.5	32B	20 – 30 GB
Llama 4	70B	42 - 67 GB
GPT-oss	120B	72 - 115 GB
MiniMax-M2.5	229B	140 - 190 GB
Mistral-3-Large, DeepSeek V3/R1	675B	400 - 650 GB

What is Quantization?

These techniques minimally reduce the precision of model weights but massively reduce memory requirements. A 4-bit model often offers the best ratio between performance and memory requirements. Not suitable for low error tolerance.

RECT AI Server in 4U Chassis

With three NVIDIA RTX PRO 6000 Blackwell

Epyc Turin Rack Workstation!

with brand-new AMD Epyc 9005 processors up to 160 cores

RECT™ WS-8839C

Built for AI Performance!
All-new RECT™ Rack Workstation featuring AMD Epyc™ 9005 (Turin) processors:

New: Single socket up to AMD Epyc 9845 (160 Cores, 320 MB Cache)
up to 1.5 TB DDR5-5600 ECC RAM
up to 2x Nvidia RTX PRO 6000 Blackwell
scalable traditional storage with SSDs/HDDs up to 130 TB capacity
up to two M.2 NVMe SSDs (PCIe 4.0)
onboard: 2x 1Gbit or 2x 10Gbit LAN ports

starting at

4,072 €

Ultimate Workstation in Rack!

with AMD Threadripper™ PRO 9000WX - the CPU for professional Workloads

RECT™ WS-8829C

Built for Performance! All-new 4U RECT™ Workstation featuring brand-new AMD Ryzen™ Threadripper™ PRO 9000WX CPUs:

New: up to 96 Cores and 192 Threads,
up to 5.40 GHz and 384 MB L3 cache
up to 1 TB DDR5-5600 ECC RAM
up to 3x Nvidia RTX PRO 6000 Blackwell
or 2x Nvidia Geforce RTX 5090/5080 GPUs
up to four M.2 NVMe SSDs (PCIe 5.0)
AMD WRX90 Workstation chipset
onboard: 2x 10Gbit LAN ports

starting at

5,742 €

Why Are Context Windows and Tokens Important?

The more text a model is supposed to process and retain in context simultaneously, the more VRAM is typically required for the context window or the so-called Key-Value Cache.

Example: Analyzing a book The context window must be large enough to capture and remember the entire content. For a book with approximately 500 pages, this may require about 166,000 to 250,000 tokens. This additional memory requirement can be in the range of 10 to 20 GB VRAM.

Servers and Software Fuse into an AI Solution

Powerful hardware is the foundation, but the right software enables efficient and comfortable operation of AI models. Modern interfaces significantly simplify installation, management, and usage. That's why we have selected solutions that are particularly well-suited for local server operation.

vLLM - The Market Leader

Focus: Maximum throughput via PagedAttention and efficient VRAM management.
Advantage: Fully operable locally and equipped with an API that closely follows the OpenAI standard.
Use: Ideal for companies where many users access large models simultaneously.

Ollama - The Straightforward Solution

Focus: Maximum simplicity in installation, operation, and usage.
Advantage: Fewer technical hurdles, clean API, and a fast path from setup to productive use.
Use: Suitable for prototyping, internal applications, and smaller teams with high demands for ease of use and speed.

Lemonade Server - The Swiss Army Knife

Focus: Lean open-source server framework.
Advantage: Thanks to OpenAI-compatible API, existing workflows can be easily migrated. Highly optimized for quantization formats that even run on affordable consumer hardware.
Use: Optimized for air-gapped operation (completely offline). Data is loaded as binary files or Docker containers.

Small AI Models Directly at the Workplace

AI Workstation for Local Models Under Your Desk

more Performance with AI

with all-new Intel Core Ultra Processors Serie 2

RECT™ WS-2274C

brand-new Intel Core Ultra Processors in a RECT Workstation:

Intel Core Ultra Processors up to 24 cores and up to 5.70 GHz
Workstation-Mainboard with Intel Z890
or B860 chipset
NEW: up to 256 GB DDR5 RAM
up to four M.2 NVMe SSD and up to
104 TB HDD/SSD storage
Water Cooling optional
NPU AI-Engines (13 TOPS)

starting at

1,139 €

The most advanced PC processor

with all-new AMD Ryzen™ 9000

RECT™ WS-2228C

All-new AMD Ryzen™ 9000 series CPUs in RECT™ Workstation:

NEW: up to AMD Ryzen™ 9 9950X3D
(4.30 GHz, 16 cores, 128 MB cache)
Workstation-Mainboard with AMD X870/E
or B850 chipset
NEW: up to 256 GB DDR5 RAM
up to four M.2 NVMe SSD and up to
104 TB HDD/SSD storage
Water Cooling optional
up to two high-end graphics cards

starting at

1,161 €

If you don't have a server room in your company, smaller AI model versions are conceivable. If you want to run Llama, Mistral, or Gemma locally on your desktop, hardware resources and the choice of the right software are crucial.

Llama 4 (8B): ~8 GB VRAM. Geforce RTX 5060, Radeon RX 9060
Mistral 3 (14B): ~12 GB VRAM. Geforce RTX 5070 , Radeon RX 9070
Gemma 3 (27B): ~20 GB VRAM. Geforce RTX 5090, Radeon RX 7900 XTX

Desktop AI Software

With modern AI interfaces, you can implement local AI models much more easily. Some solutions are designed so that you can get started and work quickly even without programming knowledge.

LM Studio – The Often Mentioned

Enables simple testing, deployment, and management of local language models.
Interface: Chat-like (like ChatGPT), quick onboarding.
Advantage: Integrated model search, VRAM hints, and use as a local server.

AnythingLLM - The Document Analysis

Connects local language models with document workflows for knowledge queries from your own files.
Interface: Workspace-based, thematic structuring of content.
Advantage: Drag-and-drop for documents, AI incorporates content into responses, local and privacy-friendly.

ComfyUI – The Creative

Ideally suited for image and video generation (e.g., with Stable Diffusion) and the integration of language models into complex workflows. Interface: Node-based, functions are visually linked into logical chains.
Advantage: Automates creative processes and connects individual steps, e.g., for prompt optimization or image description.

In Some Hybrid Cases, We Recommend a GPU

Whether a GPU is required in these scenarios depends primarily on: The amount of data, The complexity of the tasks, The number of simultaneous users. The higher the load in practical use, the more GPU acceleration is recommended.

For real-time analysis of video data, GPU performance is often the more sensible foundation, especially when dealing with high frame rates, multiple parallel streams, or more complex models. GPU acceleration also ensures that very long computing runs for training AI models for object recognition or image classification become practical in terms of training time.

With a GPU, you benefit from significantly higher processing speeds for both speech-to-text and text-to-speech conversion. This is particularly noticeable when working with larger amounts of data, meeting real-time requirements, or processing multiple requests simultaneously.

In reinforcement learning, a model improves its decisions through continuous trial and error, direct feedback, and gradual optimization. To efficiently implement such computationally intensive training processes, GPU acceleration provides the crucial foundation for powerful and practical results.

AI Training as a Special Case

If you want to train AI models yourself, you need an infrastructure designed for maximum computing power. During training, large amounts of data are processed, countless model adjustments are calculated in a short time, and high-performance GPU systems with high graphics memory are designed for this. The later use of a finished model is much more modest, as no complex learning processes take place; instead, existing structures are simply used. Therefore, for many applications in ongoing operation, a solid GPU server is sufficient, while training itself imposes significantly higher requirements.

Training vs Inference
	Inference (Usage)	Training (Learning)
VRAM	Low to moderate	High memory requirement per GPU
Precision	Quantized is sufficient	BF16 / FP32 mandatory
Number of GPUs	Often 1 to 4 GPUs suffice	GPU server cluster recommended
Performance	Short-term peak loads	Continuous high load over longer periods

AI and GPU Servers for Maximum LLM Computing Power

AI Application Areas Where GPU Servers Provide the Foundation for Maximum Performance

Document Analysis

LLM

Image and Video

Which Components Are Most Important and Why?

In the RECT 2HE Server - You Have a Choice

High-End VRAM Recommendation for LLMs in FP16

2U Rack Server with AMD EPYC 9005 CPUs up to 160 Cores

2U Rack Server with AMD EPYC 9005 CPUs and 4 GPU cards

Technology Decides

How Much VRAM Does a Local LLM Need?

5 LLM Examplesin 4-Bit Quantization

What is Quantization?

RECT AI Server in 4U Chassis

with brand-new AMD Epyc 9005 processors up to 160 cores

with AMD Threadripper™ PRO 9000WX - the CPU for professional Workloads

Why Are Context Windows and Tokens Important?

Servers and Software Fuse into an AI Solution

vLLM - The Market Leader

Ollama - The Straightforward Solution

Lemonade Server - The Swiss Army Knife

Small AI Models Directly at the Workplace

with all-new Intel Core Ultra Processors Serie 2

with all-new AMD Ryzen™ 9000

Desktop AI Software

LM Studio – The Often Mentioned

AnythingLLM - The Document Analysis

ComfyUI – The Creative

In Some Hybrid Cases, We Recommend a GPU

AI Training as a Special Case

Training vs Inference

Note All specified values are to be understood as guidelines and may vary depending on the model and use case.

The latest server and computer trends in the RECT™ shop

Welcome to the RECT Shop

We use cookies

Necessary

Statistics

Marketing

High-End VRAM Recommendation
for LLMs in FP16