Hands on Meta has been influential in driving the development of open language models with its Llama family, but up until now, the only way to interact with them has been through text.
With the launch of its multimodal models late last month, the Facebook parent has given Llama 3 sight.
According to Meta, these models can now use a combination of images and text prompts to "deeply understand and reason on the combination." For example, the vision models could be used to generate appropriate keywords based on the contents of an image, chart, or graphics, or extract information from a PowerPoint slide.
You can ask this openly available model, which can be locally run not just in the cloud, not only what's in a picture, but ask questions or make a request about that content.
That said, in our testing, we found that, just like the scarecrow in the Wizard of Oz, what this model could really use is a brain.
Meta's vision models are available in both 11 and 90 billion parameter variants, and have been trained on a large corpus of image and text pairs.
These models are by no means the first vision-equipped LLMs we've come across. Microsoft has Phi 3.5 Vision, Mistral has Pixtral, and before that there was Llava, as examples. However, Meta's latest neural networks are notable because this marks the first time the Facebook owner - a corporation that has arguably set the tone for open LLM development - has ventured into the multimodal space.
To find out whether Meta's latest Llama had 20/20 vision or could use a pair of spectacles, we spun up the 11 billion parameter version of the model in a vLLM model server.
We ran the model through a number of scenarios ranging from optical character recognition (OCR) and handwriting recognition to image classification and data analysis to get a sense of where its strengths and weaknesses lie - and pretty much immediately ran into its limitations.
Our first test involved uploading a line chart from the US Bureau of Labor Statistics and asking the model to analyze the graph.
Initially, the results look promising, but it quickly becomes clear that most of the model's conclusions - and figures - don't match what the graph actually depicts. For example, the model asserts that the working-poor rate peaks in 1993 and declines afterward, which isn't true. It then contradicts itself by citing a higher rate in 2010, which is also incorrect by several points. From there, the errors just keep piling up.
To ensure this wasn't a fluke, we asked the model to analyze a bar chart showing the change in the consumer price index for food, energy, and all items excluding food and energy. It seems like a simple task, given how straightforward the chart is, right? Well, not for this Llama.
As you can see, once again, it's obvious that the model was able to extract some useful information, but very quickly started making things up.
We ran a few more tests with different chart types including a scatter plot (not pictured) and came up with similar results that, while clearly influenced by the contents of the image, were riddled with errors.
It's possible that this is due to how we configured the model in vLLM and it's blowing through its context - essentially its short-term memory - and getting lost. However, we found that when this was the case, the model usually just stalled out on us rather than issuing a nonsensical response. vLLMs logs also didn't show anything to suggest we were running out of KV cache.
Instead, we suspect the real issue here is interpreting a chart, even a simple one, requires a level of reasoning and understanding that an 11-billion-parameter neural network just isn't capable of simulating yet. Perhaps the 90-billion-parameter version of the model performs better here, but unless you've got a server full of H100s or H200s at your disposal, good luck running that at home.
So, while Meta may have given Llama eyes, what it really needs is a brain. But since vision is apparently a much easier problem to solve than artificial general intelligence, we guess we can forgive Meta for having to start somewhere.
While analyzing images and drawing conclusions from them may not be Llama 3.2 11B's strong point, that's not to say it didn't perform well in other scenarios.
Image recognition and classification
Presented with an image of Boston Dynamics' robodog Spot, snapped on the Siggraph show floor earlier this summer, the model was not only able to identify the bot by name but offered a fairly comprehensive description of the image.
Yes, computer vision has been able to do object and scene classification for a long while, but as you can see, the LLM nature of the model allows it to riff on the image a little more.
We then fed in a stock image of a mountain lake filled with boats to see how effective the model was at following directions.
Zoning images is another thing computer vision has been able to do, though again, you might find it potentially interesting to perform that with a natural language model like Meta's. In fact with a lot of these tests, what Llama can do isn't a breakthrough, but that doesn't make it less useful.
Sentiment analysis
Another potential use case for these kinds of models is for sentiment analysis based on facial expressions or body language. One can imagine using these models to get a sense of whether stadium goers are enjoying a half-time show or if customers at a department store are responding favorably to a new marketing campaign.
To put this to the test, we ran a stock image from our back catalog depicting a tired IT worker through the model to see how effective it might be at evaluating the emotional state of a subject from an image.
While the model didn't describe the subject as looking tired, it still managed to provide an accurate evaluation of his likely emotional state.
OCR and Handwriting recognition
We also found the model had little to no problem with stripping text from images and even hand-written notes.
The first may as well be table stakes for existing machine learning models - your iPhone will do this automatically - but the handwriting recognition was actually quite good. As you can see from the example below, the model was able to convert the note to text with only a handful of errors. That said, the note was by no means chicken scratch, so your mileage may vary.
While less reliable we even found the model was also reasonably effective at extracting the contents of tables and converting them into other formats. In this example we took a screenshot showing the relative floating performance of various Nvidia Blackwell GPUs announced this spring and asked Llama 3.2 to convert it to a markdown table.
The result isn't perfect - the model clearly struggles with the concept of empty cells - but fixing that is relatively straightforward compared to writing the table manually.
Unfortunately, this test was also the least consistent. In some cases, the model would randomly drop a column or omit critical information. So, we wouldn't exactly trust it to work perfectly every time.
If you'd like to test out Meta's vision models for yourself, they're relatively easy to get up and running, assuming you've got a beefy enough graphics card.
As we mentioned earlier, we'll be using vLLM, which is commonly used to serve LLMs in production environments. To keep things simple, we'll be running the vLLM OpenAI-compatible API server in a Docker container.
Normally, we'd stick with something like Ollama, as it's dead simple to set up. But at the time of writing, vLLM was one of the few model runners with support for the vision models. We're told Ollama plans to add support for the models in the near future. Meanwhile, just before publication, Rust-based LLM runner Mistral.rs announced support for the model.
In this guide, we'll be deploying two containers: The first being vLLM and the second being Open WebUI, which will provide us with a convenient interface for interacting with the model.
Deploying the full model
Assuming you've got an Nvidia GPU with at least 24 GB of vRAM you can run the full Llama 3.2 11B vision model at 16-bit precision, using the one-liner below. Just don't forget to request access to the model here, and replace it with your access token.
Note: depending on your configuration, you may need to run this command with elevated privileges, eg via sudo.
Depending on how fast your internet connection is, this may take a few minutes to spin up as the model files are roughly 22 GB in size. Once complete, the API server will be exposed on port 8000.
Deploy an 8-bit quantized model
If you don't have enough vRAM to run the model at 16-bit precision, you may be able to get away with using an 8-bit quantized version of the model, like this FP8 version from Neural Magic. At this precision, the model needs a little over 11 GB of vRAM and should fit comfortably within a 16 GB card, like the RTX 4060 Ti 16GB or RTX 4080.
Note that quantization can have a negative impact on the quality of model outputs. You can find more information on model quantization here.
Deploy the WebUI
Once you've got the model server running, you can either start querying the Llama 3.2 11B using something like curl, or use a front-end like Open WebUI, which is what we opted to do. We previously explored Open WebUI in our retrieval augmented generation (RAG) guide here, but if you just want to get up and running to test with, you can deploy it using this Docker command:
Once it's spun up, you can access Open WebUI by visiting http://localhost:8080. If you're running Open WebUI on a different machine or server, you'll need to replace localhost with its IP address or hostname, and make sure port 8080 is open on its firewall or otherwise reachable by your browser.
From there, create a local account (this will automatically be promoted to administrator) and add a new OpenAI API connection under the Admin Panel using localhost:8000/v1 as the address and empty as the API key.
Put it to the test:
If everything is configured correctly, you should be able to start a new chat, and select Meta's Llama 3.2 11B model from the drop-down menu at the top of the page.
Next, upload your image, enter a prompt, and press the Enter key to start chatting with the model.
Troubleshooting
If you run into trouble with the model stalling out or refusing to load, you may need to either reduce the size of your image or increase the context window.
The bigger and more complex the image, the more of the model's context it consumes. Llama 3.2 supports a pretty large 128,000 token context window, which is great unless you're memory constrained.
In our guide, we set our context window or --max-model-len aka context-size to 8192 tokens in order to avoid running out of memory on a 24 GB card. But if you've got memory to spare, or you opted for a quantized model, you can increase this to something like 12,288 or 16,384. Just know if you push this too far, you may run into out-of-memory errors. In that case, try dropping the context-size in increments of 1024 until the model loads successfully.
In our testing, we also noticed that a little prompt engineering can go a long way to getting better results. Giving a vision model a pic and telling it exactly what you want it to do with it, say, "Identify the four most relevant features in this image and assign the relevant photograph keywords," is likely going to render a more useful result than "What is this?"
While this might seem obvious, images can contain a lot of information that our brains will automatically filter out, but the model won't unless it's told to focus on a specific element. So, if you're running into problems getting the model to behave as intended, you might want to spend a little extra time writing a better prompt.
The Register aims to bring you more on using LLMs, vision-capable LLMs, and other AI technologies - without the hype - soon. We want to pull back the curtain and show how this stuff really fits together. If you have any burning questions on AI infrastructure, software, or models, we'd love to hear about them in the comments section below. ®
Editor's Note: The Register was provided an RTX 6000 Ada Generation graphics card by Nvidia and an Arc A770 GPU by Intel to support stories like this. Neither supplier had any input as to the contents of this and other articles.
21 lines that show the big man still has what it takes
Webinar Boost your organization's AI application performance with optimized SQL vector data queries
Screens sprayed with coffee after techies find Microsoft's latest OS in unexpected places
Need to know how to set up a business? There's an (experimental) AI for that
Ubuntu Summit 2024 'First impressions matter' but a KDE flavor is in the making - and more publicly at that
Change of mind follows discovery China was playing with it uninvited?
Firefox overlord to 'revisit' advocacy mission