A new study suggests that GPT-4 can do more with faces in photos than many users may expect. Researchers found that the model could recognize faces, determine gender and estimate age with accuracy that rivals specialized facial recognition algorithms, despite not being explicitly trained for biometric tasks.
The finding matters because it brings two issues together: strong biometric performance and a safety system that researchers were able to work around with a simple prompt. The study does not argue that GPT-4 should replace dedicated recognition tools. In fact, the authors warn against relying on it alone, because it can produce descriptions that sound credible while still being wrong.
What the researchers tested
The study was conducted by researchers at the Norwegian University of Science and Technology and the Mizani and Idiap Research Institute. They examined GPT-4's biometric capabilities and compared its performance with specialized facial recognition algorithms, including MobileFaceNet.
The work focused on three practical tasks: recognizing faces, determining gender and estimating age in photos. These are not casual image descriptions. They are biometric tasks, because they involve identifying or inferring sensitive traits from a person's face.
The central point is that GPT-4 was not explicitly trained to perform these tasks. Even so, the model performed at a level that the researchers found comparable to systems built specifically for facial analysis.
Gender recognition and age estimation results
In gender recognition tests, GPT-4 reached 100% accuracy on a dataset of 5,400 balanced images. That result exceeded DeepFace, a model designed for the same task, which scored 99% accuracy.
The age estimation results were also notable, though not perfect. Using the UTKFace dataset, GPT-4 correctly identified the age range 74.25% of the time. The researchers observed that the model tended to give wider age ranges for people over 60 than for younger individuals.
These results show why the study is more than a technical curiosity. A general-purpose model that can perform biometric tasks well can create value in some settings, but it also raises questions about when such analysis should be allowed, how it should be controlled and what users should be told about its limits.
- Facial recognition: GPT-4 performed on par with specialized facial recognition algorithms such as MobileFaceNet.
- Gender recognition: GPT-4 achieved 100% accuracy on a dataset of 5,400 balanced images.
- Age estimation: GPT-4 correctly identified the age range 74.25% of the time using the UTKFace dataset.
The safety issue the study exposed
The researchers also reported a potential safety problem. GPT-4 has built-in safeguards against revealing sensitive biometric information, but the researchers found a way around them.
The workaround was simple: they claimed in the prompt that an image was AI-generated. With that framing, they were able to trick the system into analyzing real photos. According to the researchers, this shows the need for more safety research into large language models, especially when those models perform strongly on biometric tasks.
This is the most important distinction in the study. The fact that large language models can have biometric capabilities is not new. The source article notes that OpenAI has pointed this out before, and that in the OpenAI app Be My Eyes for the visually impaired, person recognition was disabled at an early stage.
What is new here is the combination of high accuracy and a bypass that did not require a complex method. If a model can be prompted into treating real photos as something else, safeguards may fail at exactly the point where sensitive analysis begins.
Why the findings should be read carefully
The study does not support treating GPT-4 as a flawless recognition system. The authors warn against relying solely on GPT-4 for recognition tasks because it can give convincing but incorrect descriptions.
That warning is important. A system can be highly accurate in a test and still fail in ways that matter. When the output concerns a person's face, gender or age, an incorrect answer is not just a technical error. It can affect how a person is interpreted or treated.
The research also reinforces a broader lesson about AI safety. Safeguards need to account not only for what a model can do, but also for how users may frame a request. If a prompt changes the model's behavior around sensitive biometric information, then the safety design must handle that route directly.
For developers, researchers and product teams, the message is straightforward: large language models can show unexpected strength outside their explicit training target. When that strength touches biometric information, accuracy alone is not enough. The system also needs reliable boundaries, clear limits and careful evaluation under realistic prompts.