Close Menu
SkytikSkytik

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 2025

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 2025

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 2025
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    SkytikSkytik
    • Home
    • AI Tools
    • Online Tools
    • Tech News
    • Guides
    • Reviews
    • SEO & Marketing
    • Social Media Tools
    SkytikSkytik
    Home»AI Tools»Using Local LLMs to Discover High-Performance Algorithms
    AI Tools

    Using Local LLMs to Discover High-Performance Algorithms

    AwaisBy AwaisJanuary 19, 2026No Comments10 Mins Read0 Views
    Facebook Twitter Pinterest LinkedIn Telegram Tumblr Email
    Using Local LLMs to Discover High-Performance Algorithms
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Ever since I was a child, I’ve been fascinated by drawing. What struck me was not only the drawing act itself, but also the idea that every drawing could be improved more and more. I remember reaching very high levels with my drawing style. However, once I reached the peak of perfection, I would try to see how I could improve the drawing even further – alas, with disastrous results.

    From there I always keep in mind the same mantra: “refine and iterate and you’ll reach perfection”. At university, my approach was to read books many times, expanding my knowledge searching for other sources, for finding hidden layers of meaning in each concept. Today, I apply this same philosophy to AI/ML and coding.

    We know that matrix multiplication (matmul for simplicity here), is the core part of any AI process. Back in the past I developed LLM.rust, a Rust mirror of Karpathy’s LLM.c. The hardest point in the Rust implementation has been the matrix multiplication. Since we have to perform thousands of iterations for fine-tuning a GPT-based model, we need an efficient matmul operation. For this purpose, I had to use the BLAS library, implementing an unsafe strategy for overcoming the limits and barriers. The usage of unsafe in Rust is against Rust’s philosophy, that’s why I am always looking for safer methods for improve matmul in this context.

    So, taking inspiration from Sam Altman’s statement – “ask GPT how to create value” – I decided to ask local LLMs to generate, benchmark, and iterate on their own algorithms to create a better, native Rust matmul implementation.

    The challenge has some constraints:

    • We need to use our local environment. In my case, a MacBook Pro, M3, 36GB RAM;
    • Overcome the limits of tokens;
    • Time and benchmark the code within the generation loop itself

    I know that achieving BLAS-level performances with this method is almost impossible, but I want to highlight how we can leverage AI for custom needs, even with our “tiny” laptops, so that we can unblock ideas and push boundaries in any field. This post wants to be an inspiration for practitioners, and people who want to get more familiar with Microsoft Autogen, and local LLM deployment.

    All the cod implementation can be found in this Github repo. This is an on-going experiment, and many changes/improvements will be committed.

    General idea

    The overall idea is to have a roundtable of agents. The starting point is the MrAderMacher Mixtral 8x7B model Q4 K_M local model. From the model we create 5 entities:

    • the Proposer comes up with a new Strassen-like algorithm, to find a better and more efficient way to perform matmul;
    • the Verifier reviews the matmul formulation through symbolic math;
    • the Coder creates the underlying Rust code;
    • the Tester executes it and saves all the info to the vector database;
    • the Manager acts silently, controlling the overall workflow.
    AgentRole function
    ProposerAnalyses benchmark times, and it proposes new tuning parameters and matmul formulations.
    Verifier(Currently disabled in the code). It verifies the proposer’s mathematical formulation through symbolic verification.
    CoderIt takes the parameters, and it works out the Rust template code.
    TesterIt runs the Rust code, it saves the code and computes the benchmark timing.
    ManagerOverall control of the workflow.
    Tab. 1: Roles of agents.

    The overall workflow can be orchestrated through Microsoft Autogen as depicted in fig.1.

    Fig.1: Matmul optimisation. The user have an initial request with a prompt. From there the manager orchestrates the overall workflow: 1) The proposer acts a theorist and generates a Strassen-like algorithm; 2) The verifier checks the mathematical correctness of the code; 3) The coder generates a Rust Neon code; 4) The tester runs the benchmark. [Image generated with Nano Banana Pro].

    Prepare the input data and vector database

    The input data is collected from all academic papers, focused on matrix multiplication optimisation. Many of these papers are referenced in, and related to, DeepMind’s Strassen paper. I want to start simply, so I collected 50 papers, published from 2020 till 2025, that specifically address matrix multiplication.

    Next, I’ve used chroma to create the vector database. The critical aspect in generating a new vector database is how the PDFs are chunked. In this context, I used a semantic chunker. Differently from split text methods, the semantic chunker uses the actual meaning of the text, to determine where to cut. The goal is to keep the related sentences together in one chunk, making the final vector database more coherent and accurate. This is done using the local model BAAI/bge-base-en-v1.5. The Github gist below shows the full implementation.

    The core code: autogen-core and GGML models

    I have used Microsoft Autogen, in particular the autogen-core variant (version 0.7.5). Differently from the higher-level chat, in autogen-core we can have access to low-level event-driven building blocks, that are necessary to create a state-machine-driven workflow as we need. As a matter of fact, the challenge is to maintain a strict workflow. All the acting agents must act in a specific order: Proposer –> Verifier –> Coder –> Tester.

    The core part is the BaseMatMulAgent, that inherits from AutoGen’s RoutedAgent. This base class allows us to standardise how LLM agents will take part in the chat, and they will behave.

    From the code above, we can see the class is designed to participate in an asynchronous group chat, handling conversation history, calls to external tools and generating responses through the local LLM.

    The core component is @message_handler, a decorator that registers a method as listener or subscriber , based on the message type. The decorator automatically detects the type hint of the first method’s argument – in our case is message: GroupChatMessage. It then subscribes the agent to receive any events of that type sent to the agent’s topic. The handle_message async method is then responsible for updating the agent’s internal memory, without generating a response.

    With the listener-subscriber mechanism is in place, we can focus on the Manager class. The MatMulManager inherits RoutedAgent and orchestrates the overall agents’ flow.

    The code above handles all the agents. We are skipping the Verifier part, for the moment. The Coder publish the final code, and the Tester takes care of saving both the code and the whole context to the Vector Database. In this way, we can avoid consuming all the tokens of our local model. At each new run, the model will catch-up on the latest generated algorithms from the vector database and propose a new solution.

    A very important caveat, for making sure autogen-core can work with llama models on MacOS, make use of the following snippet:

    #!/bin/bash 
    
    CMAKE_ARGS="-DGGML_METAL=on" FORCE_CMAKE=1 pip install --upgrade --verbose --force-reinstall llama-cpp-python --no-cache-dir

    Fig.2 summarises the entire code. We can roughly subdivide the code into 3 main blocks:

    • The BaseAgent, that handles messages through LLM’s agents, evaluating the mathematical formulation and generating code;
    • The MatMulManager orchestrates the entire agents’ flow;
    • autogen_core.SingleThreadedAgentRuntime allows us to make the entire workflow a reality.
    Fig.2: Overall workflow in a nutshell. The base agent executes the LLM through agents, it evaluates the mathematical formulation, creates the algorithm in Rust, and save all the info in the vector database. The MatMulManager is the real core of the overall workflow. Finally, the autogen_core.SingleThreadedAgentRuntime makes all of this to work on our MacBook PRO. [Image created with Nano Banana Pro.]

    Results and benchmark

    All the Rust code has been revised and re-run manually. While the workflow is robust, working with LLMs requires a critical eye. Several times the model confabulated*, generating code that looked optimised but failed to perform the actual matmul work.

    The very first iteration generates a sort of Strassen-like algorithm (“Run 0” code in the fig.3):

    The model thinks of better implementations, more Rust-NEON like, so that after 4 iterations it gives the following code (“Run 3” in fig.3):

    We can see the usage of functions like vaddq_f32, specific CPU instruction for ARM processors, coming from std::arch::aarch64. The model manages to use rayon to split the workflow across multiple CPU cores, and inside the parallel threads it uses NEON intrinsics. The code itself is not totally correct, moreover, I’ve noticed that we’re running into an out-of-memory error when dealing with 1024×1024 matrices. I had to manually re-work out the code to make it work.

    This brings us back to our my mantra “iterating to perfection”, and we can ask ourselves: ‘can a local agent autonomously refine Rust code to the point of mastering complex NEON intrinsics?’. The findings show that yes, even on consumer hardware, this level of optimisation is achievable.

    Fig.3 shows the final results I’ve obtained after each iterations.

    Fig.3: Logarithmic plot of the Rust-Neon implementation at various iterations. The calculations have been performed on 1024×1024 Matrix Multiplication benchmarks. [Image generated by the author].

    The 0th and 2nd benchmark have some errors, as it is physically impossible to achieve such a results on a 1024×1024 matmul on a CPU:

    • the first code suffers from a diagonal fallacy, so the code is computing only diagonal blocks of the matrix and it is ignoring the rest;
    • the second code has a broken buffer, as it is repeatedly overwriting a small, cache-hot buffer 1028 floats, rather than transversing the full 1 million elements.

    However, the code produced two real code, the run 1 and run 3. The first iteration achieves 760 ms, and it constitutes a real baseline. It suffers from cache misses and lack of SIMD vectorisation. The run 3 records 359 ms, the improvement is the implementation of NEON SIMD and Rayon parallelism.

    *: I wrote “the model confabulates” on purposes. From a medical point-of-view, all the LLMs are not hallucinating, but confabulating. Hallucinations are a totally different situation w.r.t what LLMs are doing when babbling and generating “wrong” answers.

    Conclusions

    This experiment started with a question that seemed an impossible challenge: “can we use consumer-grade local LLMs to discover high-performance Rust algorithms that can compete with BLAS implementation?”.

    We can say yes, or at least we have a valid and solid background, where we can build up better code to achieve a full BLAS-like code in Rust.

    The post showed how to interact with Microsoft Autogen, autogen-core, and how to create a roundtable of agents.

    The base model in use comes from GGUF, and it can run on a MacBook Pro M3, 36GB.

    Of course, we didn’t find (yet) anything better than BLAS in a single simple code. However, we proved that local agentic workflow, on a MacBook Pro, can achieve what was previously thought to require a massive cluster and massive models. Eventually, the model managed to find a reasonable Rust-NEON implementation, “Run 3 above”, that has a speed up of over 50% on standard Rayon implementation. We must highlight that the backbone implementation was AI generated.

    The frontier is open. I hope this blogpost can inspire you in trying to see what limits we can overcome with local LLM deployment.


    I am writing this in a personal capacity; these views are my own.

    Algorithms Discover highperformance LLMs Local
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Awais
    • Website

    Related Posts

    Escaping the SQL Jungle | Towards Data Science

    March 21, 2026

    A Gentle Introduction to Nonlinear Constrained Optimization with Piecewise Linear Approximations

    March 21, 2026

    Agentic RAG Failure Modes: Retrieval Thrash, Tool Storms, and Context Bloat (and How to Spot Them Early)

    March 21, 2026

    Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

    March 21, 2026

    How to Measure AI Value

    March 20, 2026

    What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

    March 20, 2026
    Leave A Reply Cancel Reply

    Top Posts

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 20250 Views

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 20250 Views

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 20250 Views
    Don't Miss

    Escaping the SQL Jungle | Towards Data Science

    March 21, 2026

    don’t collapse overnight. They grow slowly, query by query. “What breaks when I change a…

    SEO’s new battleground: Winning the consensus layer

    March 21, 2026

    A Gentle Introduction to Nonlinear Constrained Optimization with Piecewise Linear Approximations

    March 21, 2026

    23 Radish Recipes for Salads, Pickles, and More

    March 21, 2026
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Google confirms AI headline rewrites test in Search results

    March 21, 2026

    How to add Google Calendar to Outlook

    March 21, 2026
    Most Popular

    13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

    November 18, 20257 Views

    How to watch the 2026 GRAMMY Awards online from anywhere

    February 1, 20263 Views

    Corporate Reputation Management Strategies | Sprout Social

    November 19, 20252 Views
    Our Picks

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 2025

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 2025

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest YouTube Dribbble
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • Disclaimer

    © 2025 skytik.cc. All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.