A bare-metal, hardware-agnostic Large Language Model (LLM) inference engine designed to run massive models on heavily memory-constrained edge devices, and to act as a safety feature for running LLMs locally.
FloatLLM is built for a fundamental shift in local AI execution: Dynamic Zero-Copy Memory Chunking.
Originally, handling models larger than host RAM relied on static, layer-by-layer disk swapping. However, static swapping creates massive I/O bottlenecks.
FloatLLM abandons static swapping. Instead, it utilizes OS-level hardware interrogation to calculate exact, real-time memory boundaries, slicing standard .gguf neural network weights into mathematically perfect execution blocks. By leveraging native mmap (memory-mapping), it creates a zero-copy hardware bridge, streaming gigabytes of tensor data from SSD to RAM at bare-metal speeds without ever triggering an Out-of-Memory (OOM) panic.
This allows massive architectures to execute natively on anything from an Apple Silicon Mac to a non-rooted Android device running terminal environments, completely offline.
FloatLLM is being developed in these stages:
floatllm_router.pyThe master entry point. The router dynamically interrogates the host machine’s hardware, evaluating total RAM, free RAM, and SSD capacity.
floatllm_loader.pyThe physical memory mapper.
gguf library to scan the model header, discovering exact tensor byte offsets without loading the massive payload.mmap bridge to swap execution chunks in and out of RAM at maximum SSD read speeds.The bare-metal execution layer utilizing ggml.
The translation layer.
tokenizer.ggml.tokens array directly from the GGUF file. Zero API calls, zero internet dependency.The output interface.
ctypes bridge, processed through the GPU, and streamed back horizontally to the user terminal in real-time.Clone this repository and install the minimal required Python libraries:
git clone https://github.com/suryanshRoy/FloatLLM.git
cd FloatLLM
pip install -r requirements.txt
FloatLLM relies on the ggml C library for the matrix operations. You must clone it into the project root before compiling:
git clone https://github.com/ggerganov/ggml.git
FloatLLM requires a model in the .gguf format. If you don’t have one, you can download a 7B parameter test model (~3.5GB)
wget:wget -c -O test_model.gguf "https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q3_K_M.gguf"
curl:curl -L -o test_model.gguf "https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q3_K_M.gguf"
Download the Stress-Test Model (14B Parameters, ~9GB) To demonstrate FloatLLM’s core innovation—dynamic zero-copy memory chunking—you need a massive model that exceeds standard available RAM. Please run this command in your terminal to download a 14-Billion parameter test model (~9GB):
wget:wget -c -O test_model.gguf "https://huggingface.co/bartowski/Qwen2.5-14B-Instruct-GGUF/resolve/main/Qwen2.5-14B-Instruct-Q4_K_M.gguf"
curl:curl -L -o test_model.gguf "https://huggingface.co/bartowski/Qwen2.5-14B-Instruct-GGUF/resolve/main/Qwen2.5-14B-Instruct-Q4_K_M.gguf"
Make sure you have CMake installed, if you don’t have then:
- Linux (Ubuntu/Debian):
sudo apt update && sudo apt install cmake- macOS:
brew install cmake- Windows:
https://cmake.org/download/
rm -rf buildFor Apple Silicon (Metal/MPS):
cmake -B build -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4
For NVIDIA GPU (CUDA):
cmake -B build -DGGML_CUDA=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4
For Vulkan GPU:
cmake -B build -DGGML_VULKAN=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4
For OpenCL:
cmake -B build -DGGML_OPENCL=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4
For SYCL (Intel OneAPI):
cmake -B build -DGGML_SYCL=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4
For Kompute / DirectX:
cmake -B build -DGGML_KOMPUTE=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4
For CPU-Only / Native ARM:
cmake -B build -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4
python floatllm_router.py --hardware auto --model-path /path/to/your/model.gguf --prompt "What is the capital of France?"
FloatLLM was fundamentally driven by human architectural design, but AI tools were actively leveraged as collaborative research and debugging assistants. I acted as the core systems architect, directing the routing logic, memory management, and broad structural shifts.
During development, Google Search AI Overviews were utilized for researching core concepts and discovering cross-platform C++ libraries. Gemini was heavily utilized as a debugging partner to help troubleshoot the project’s most difficult technical hurdles. Specifically, Gemini assisted in debugging the bare-metal C++ inference engine crashes, optimizing tensor management within the Python loader, and resolving complex OS-specific ggml bugs. All final implementations and architectural decisions were independently executed and tested.