OpenAI on Tuesday announced the qualified arrival of GPT-4, its latest milestone in the making of call-and-response deep learning models and one that can seemingly outperform its fleshy creators in important exams.
According to OpenAI, the model exhibits "human-level performance on various professional and academic benchmarks." GPT-4 can pass a simulated bar exam in the top 10 percent of test takers, whereas its predecessor, GPT-3.5 (the basis of ChatGPT) scored around the bottom 10 percent.
GPT-4 also performed well on various other exams, like SAT Math (700 out of 800). It's not universally capable, however, scoring only 2 on the AP English Language and Composition (14th to 44th percentile).
GPT-4 is a large multimodal model, as opposed to a large language model. It is designed for accepting queries via text and image inputs, with answers returned in text. It's being made available initially via the waitlisted GPT-4 API and to ChatGPT Plus subscribers in a text-only capacity. Image-based input is still being refined.
Despite the addition of a visual input mechanism, OpenAI is not being open about or providing visibility into the making of its model. The upstart has chosen not to release details about its size, how it was trained, nor what data went into the process.
"Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar," the company said in its technical paper [PDF].
In a live stream on YouTube, Greg Brockman, president and co-founder of OpenAI, demonstrated the difference between GPT-4 and GPT-3.5 by asking the models to summarize the OpenAI GPT-4 blog post in a single sentence where every word begins with the letter "G."
GPT-3.5 simply didn't try. GPT 4 returned "GPT-4 generates ground-breaking, grandiose gains, greatly galvanizing generalized AI goals." And when Brockman told the model that the inclusion of "AI" in the sentence doesn't count, GPT-4 revised its response in another G-laden sentence without "AI" in it.
Finally, Brockman set up GPT-4 to analyze 16 pages of US tax code to return the standard deduction for a couple, Alice and Bob, with specific financial circumstances. OpenAI's model responded with the correct answer, along with an explanation of the calculations involved.
Beyond better reasoning, evident in its improved test scores, GPT-4 is intended to be more collaborative (iterating as directed to improve previous output), better able to handle lots of text (analyzing or outputting novella-length chunks of around 25,000 words), and of accepting image-based input (for object recognition, though that capability isn't yet publicly available).
What's more, GPT-4, according to OpenAI, should be less likely to go off the rails than its predecessors.
"We've spent six months iteratively aligning GPT-4 using lessons from our adversarial testing program as well as ChatGPT, resulting in our best-ever results (though far from perfect) on factuality, steerability, and refusing to go outside of guardrails," the org says.
People may already be familiar with this "far from perfect" level of safety from the rocky debut of Microsoft Bing's question answering capabilities, which uses GPT-4 as the basis for its Prometheus model.
OpenAI acknowledges that GPT-4 "hallucinates facts and makes reasoning errors" like its ancestors, but the org insists the model does so to a lesser extent.
"While still a real issue, GPT-4 significantly reduces hallucinations relative to previous models (which have themselves been improving with each iteration)," the company explains. "GPT-4 scores 40 percent higher than our latest GPT-3.5 on our internal adversarial factuality evaluations."
Pricing for GPT-4 is $0.03 per 1k prompt tokens and $0.06 per 1k completion tokens, where a token is about four characters. There's also a default rate limit of 40,000 tokens per minute and 200 requests per minute.
Despite ongoing concern about AI risks, there's a rush to bring AI models to market. On the same day GPT-4 arrived, Anthropic, a startup formed by former OpenAI employees, introduced its own chat-based helper called Claude for handling text summarization and generation, search, Q&A, coding, and more. That's also available via a limited preview.
And Google, worried about falling behind in the marketing of AP models, rolled out an API called PaLM for interacting with various large language models and a prototyping environment called MakerSuite.
A few weeks earlier, Facebook launched its LLaMA large language model, which has now been turned into the Alpaca model by Stanford researchers, which The Register will ve covering in more detail later.
"There's still a lot of work to do, and we look forward to improving this model through the collective efforts of the community building on top of, exploring, and contributing to the model," OpenAI concludes. ®
Opinion The war in Ukraine is bad and wrong... but does blocking these contributions help Ukraine?
The hope? Reducing piles of admin for clinicians freeing them up for medical work
Staff and suppliers paid late last year, new requirements lead to contract price hike
Utility that began as a personal project found its way into billions of devices
Bot also botches some requests, but is about to be baked into cloud services anyway
Meta-made small language model can produce misinformation, toxic text
Concerns over consistent dialog boxes, pinning, default apps mulled