FloatLLM

FloatLLM 🚀

FloatLLM Logo

A bare-metal, hardware-agnostic Large Language Model (LLM) inference engine designed to run massive models on heavily memory-constrained edge devices, and to act as a safety feature for running LLMs locally.

FloatLLM is built for a fundamental shift in local AI execution: Dynamic Zero-Copy Memory Chunking.

🚀 The Architectural Shift

Originally, handling models larger than host RAM relied on static, layer-by-layer disk swapping. However, static swapping creates massive I/O bottlenecks.

FloatLLM abandons static swapping. Instead, it utilizes OS-level hardware interrogation to calculate exact, real-time memory boundaries, slicing standard .gguf neural network weights into mathematically perfect execution blocks. By leveraging native mmap (memory-mapping), it creates a zero-copy hardware bridge, streaming gigabytes of tensor data from SSD to RAM at bare-metal speeds without ever triggering an Out-of-Memory (OOM) panic.

This allows massive architectures to execute natively on anything from an Apple Silicon Mac to a non-rooted Android device running terminal environments, completely offline.


🏗️ Project Architecture & Status

FloatLLM is being developed in these stages:

âś… Phase 1 (Hardware Router) - floatllm_router.py

The master entry point. The router dynamically interrogates the host machine’s hardware, evaluating total RAM, free RAM, and SSD capacity.

âś… Phase 2 (Memory Loader) - floatllm_loader.py

The physical memory mapper.

âś… Phase 3 (Inference Engine) - floatllm_compute.cpp

The bare-metal execution layer utilizing ggml.

âś… Phase 4 (Tokenizer) - floatllm_tokenizer.py

The translation layer.

Phase 5 (Generation loop) - Active (Raw logit streaming test)

The output interface.


🛠️ Usage (Building from Source)

1. Envirnoment & Requirements

Clone this repository and install the minimal required Python libraries:

git clone https://github.com/suryanshRoy/FloatLLM.git
cd FloatLLM
pip install -r requirements.txt

2. Fetch the GGML Library

FloatLLM relies on the ggml C library for the matrix operations. You must clone it into the project root before compiling:

git clone https://github.com/ggerganov/ggml.git

3. Download a Test Model

FloatLLM requires a model in the .gguf format. If you don’t have one, you can download a 7B parameter test model (~3.5GB)

Using wget:

wget -c -O test_model.gguf "https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q3_K_M.gguf"

Using curl:

curl -L -o test_model.gguf "https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q3_K_M.gguf"

Stress-test Model:

Download the Stress-Test Model (14B Parameters, ~9GB) To demonstrate FloatLLM’s core innovation—dynamic zero-copy memory chunking—you need a massive model that exceeds standard available RAM. Please run this command in your terminal to download a 14-Billion parameter test model (~9GB):

Using wget:

wget -c -O test_model.gguf "https://huggingface.co/bartowski/Qwen2.5-14B-Instruct-GGUF/resolve/main/Qwen2.5-14B-Instruct-Q4_K_M.gguf"

Using curl:

curl -L -o test_model.gguf "https://huggingface.co/bartowski/Qwen2.5-14B-Instruct-GGUF/resolve/main/Qwen2.5-14B-Instruct-Q4_K_M.gguf"

4. Build the Compute Bridge

Make sure you have CMake installed, if you don’t have then:

For Apple Silicon (Metal/MPS):

cmake -B build -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4

For NVIDIA GPU (CUDA):

cmake -B build -DGGML_CUDA=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4

For Vulkan GPU:

cmake -B build -DGGML_VULKAN=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4

For OpenCL:

cmake -B build -DGGML_OPENCL=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4

For SYCL (Intel OneAPI):

cmake -B build -DGGML_SYCL=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4

For Kompute / DirectX:

cmake -B build -DGGML_KOMPUTE=ON -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4

For CPU-Only / Native ARM:

cmake -B build -DGGML_DIR=/path/to/ggml
cmake --build build --config Release -j 4

5. Run the Engine

🤖 AI Acknowledgement

FloatLLM was fundamentally driven by human architectural design, but AI tools were actively leveraged as collaborative research and debugging assistants. I acted as the core systems architect, directing the routing logic, memory management, and broad structural shifts.

During development, Google Search AI Overviews were utilized for researching core concepts and discovering cross-platform C++ libraries. Gemini was heavily utilized as a debugging partner to help troubleshoot the project’s most difficult technical hurdles. Specifically, Gemini assisted in debugging the bare-metal C++ inference engine crashes, optimizing tensor management within the Python loader, and resolving complex OS-specific ggml bugs. All final implementations and architectural decisions were independently executed and tested.