Bake an LLM with custom prompts into your app? Sure! Here's how to get started

Hands on Large language models (LLMs) are generally associated with chatbots such as ChatGPT, Copilot, and Gemini, but they're by no means limited to Q&A-style interactions. Increasingly, LLMs are being integrated into everything from IDEs to office productivity suites.

Besides content generation, these models can be used to, for example, gauge the sentiment of writing, identify topics in documents, or clean up data sources, with of course the right training, prompts, and guardrails. As it turns out, baking LLMs for these purposes into your application code to add some language-based analysis isn't all that difficult thanks to highly extensible inferencing engines, such as Llama.cpp or vLLM. These engines take care of the process of loading and parsing a model, and performing inference with it.

In this hands on, aimed at intermediate-level-or-higher developers, we'll be taking a look at a relatively new LLM engine written in Rust called

This open source code boasts support for a growing number of popular models and not just those from Mistral the startup, seemingly the inspiration for the project's name. Plus, can be integrated into your projects using Python, Rust, or OpenAI-compatible APIs, making it relatively easy to insert into new or existing projects.

But, before we jump into how to get up and running, or the various ways it can be used to build generative AI models into your code, we need to discuss hardware and software requirements.

Hardware and software support

With the right flags, works with Nvidia CUDA, Apple Metal, or can be run directly on your CPU, although performance is going to be much slower if you opt for your CPU. At the time of writing, the platform doesn't support AMD nor Intel's GPUs just yet.

In this guide, we're going to be looking at deploying on an Ubuntu 22.04 system. The engine does support macOS, but, for the sake of simplicity, we're going to be sticking with Linux for this one.

We recommend a GPU with a minimum of 8GB of vRAM, or at least 16GB of system memory if running on your CPU - your mileage may vary depending on the model.

Nvidia users will also want to make sure they've got the latest proprietary drivers and CUDA binaries installed before proceeding. You can find more information on setting that up here.

Grabbing our dependencies

Installing is fairly straightforward, and varies slightly depending on your specific use case. Before getting started, let's get the dependencies out of the way.

According to the README, the only packages we need are libssl-dev and pkg-config. However, we found a few extra packages were necessary to complete the installation. Assuming you're running Ubuntu 22.04 like we are, you can install them by executing:

Once those are out of the way, we can install and activate Rust by running the Rustup script.

Yes, this involves downloading and executing a script right away; if you prefer to inspect the script before it runs, the code for it is here.

By default, uses Hugging Face to fetch models on our behalf. Because many of these files require you to be logged into before you deploy them, we'll need to install the huggingface_hub by running:

You'll be prompted to enter your Hugging Face access token, which you can create by visiting


With our dependencies installed, we can move on to deploying itself. To start, we'll use git to pull down the latest release of from GitHub and navigate to our working directory:

Here's where things get a little tricky, depending on how your system is configured or what kind of accelerator you're using. In this case, we'll be looking at CPU (slow) and CUDA (fast)-based inferencing in

For CPU-based inferencing, we can simply execute:

Meanwhile, those with Nvidia-based systems will want to run:

This bit could take a few minutes to complete, so you may want to a grab a cup of tea or coffee while you wait. After the executable is finished compiling, we can copy it to our working directory:

Testing out

With installed, we can check that it actually works by running a test model, such as Mistral-7b-Instruct, in interactive mode. Assuming you've got a GPU with around 20GB or more of vRAM, you can just run:

However, the odds are your GPU doesn't have the memory necessary to run the model at the 16-bit precision it was designed around. At this precision, you need 2GB of memory for every billion parameters, plus additional capacity for the key value cache. And even if you have enough system memory to deploy it on your CPU, you can expect performance to be quite poor as your memory bandwidth will quickly become a bottleneck.

Instead, we want to use quantization to shrink the model to a more reasonable size. In there are two ways to go about this. The first is to simply use in-situ quantization, which will download the full-sized model and then quantize it down to the desired size. In this case, we'll be quantizing the model down from 16 bits to 4 bits. We can do this by adding --isq Q4_0 to the previous command like so:

Note: If crashes before finishing, you probably don't have enough system memory and may need to add a swapfile - we added a 24GB one - to complete the process. You can temporarily add and enable a swapfile - just remember to delete it after you reboot - by running:

Once the model has been quantized, you should be greeted with a chat-style interface where you can start querying the model. You should also notice that the model is using considerably less memory - around 5.9GB in our testing - and performance should be much better.

However, if you'd prefer not to quantize the model on the fly, also supports pre-quantized GGUF and GGML files, for example these ones from Tom "TheBloke" Jobbins on Hugging Face.

The process is fairly similar, but this time we'll need to specify that we're running a GGUF model and set the ID and filename of the LLM we want. In this case, we'll download TheBloke's 4-bit quantized version of Mistral-7B-Instruct.

Putting the LLM to work

Running an interactive chatbot in a terminal is cool and all, but it isn't all that useful for building AI-enabled apps. Instead, can be integrated into your code using Rust or Python APIs or via an OpenAI API-compatible HTTP server.

To start, we'll look at tying into the HTTP server, since it's arguably the easiest to use. In this example, we'll be using the same 4-bit quantized Mistral-7B model as our last example. Note that instead of starting the in interactive mode, we've replaced the -i with a -p and provided the port we want the server to be accessible on.

Once the server is up and running, we can access it programmatically in a couple of different ways. The first would be to use curl to pass the instructions we want to give to the model. Here, we're posing the question: "In machine learning, what is a transformer?"

After a few seconds, the model should spit out a neat block of text formatted in JSON.

We can also interact with this using the openAI Python library. Though, you will probably need to install it using pip first:

You can then call the server using a template, such as this one written for completion tasks.

You can find more examples showing how to work with the HTTP server over in the Github repo here.

Embedding deeper into your projects

While convenient, the HTTP server isn't the only way to integrate into our projects. You can achieve similar results using Rust or Python APIs.

Here's a basic example from the repo showing how to to use the project as a Rust crate - what the Rust world calls a library - to pass a query to Mistral-7B-Instruct and generate a response. Note: We found we had to a make a few tweaks to the original example code to get it to run.

If you want to test this out for yourself, start by stepping up out of the current directory, creating a folder for a new Rust project, and entering that directory. We could use cargo new to create the project, which is recommended, but this time we'll do it by hand so you can see the steps.

Once there, you'll want to copy the mistral.json template from ../ and download the mistral-7b-instruct-v0.2.Q4_K_M.gguf model file from Hugging Face.

Next, we'll create a Cargo.toml file with the dependencies we need to build the app. This file tells the Rust toolchain details about your project. Inside this .toml file, paste the following:

Note: You'll want to remove the , features = ["cuda"] part if you aren't using GPU acceleration.

Finally, paste the contents of the demo app above into a file called

With these four files, Cargo.toml, mistral-7b-instruct-v0.2.Q4_K_M.gguf, and mistral.json in the same folder, we can test whether it works by running:

After about a minute, you should see the answer to our query appear on screen.

Obviously, this is an incredibly rudimentary example, but it illustrates how can be used to integrate LLMs into your Rust apps, by incorporating the crate and using its library interface.

If you're interested in using in your Python or Rust projects, we highly recommend checking out its documentation for more information and examples.

We hope to bring you more stories on utilizing LLMs soon, so be sure to let us know what we should explore next in the comments. ®

Editor's Note: Nvidia provided The Register with an RTX A6000 Ada Generation graphics card to support this story and others like it. Nvidia had no input as to the contents of this article.

About Us
Website HardCracked provides softwares, patches, cracks and keygens. If you have software or keygens to share, feel free to submit it to us here. Also you may contact us if you have software that needs to be removed from our website. Thanks for use our service!
IT News
Jul 12
Google can totally explain why Chromium browsers quietly tell only its websites about your CPU, GPU usage

OK, now tell us why this isn't an EU DMA violation - asking for a friend in Brussels

Jul 12
SAP's bid to woo open source community meets muted response

German software giant says open source is a 'catalyst for innovation' but is unlikely to release proprietary code

Jul 12
Stop installing that software - you may have just died

On Call They're called role-playing games for a reason ...

Jul 11
Former Autonomy CFO banned from chartered accounting group until 2038

Sushovan Hussain's failure in 2020 to appeal his 2018 fraud conviction in the US means he won't be a member of ICAEW for 20 years

Jul 11
AMD buys developer Silo AI in bid to match Nvidia's product range

First it comes for market leader's GPUs ... now it's nibbling at software

Jul 11
Firefox 128 bumps system requirements for old boxes

Get comfortable, it'll be here for a while