AHa-Bench

Benchmarking Audio Hallucinations in Large Audio-Language Models

Code 🤗 Dataset

Abstract

Hallucinations present a significant challenge in the development and evaluation of large language models (LLMs), directly affecting their reliability and accuracy. While notable advancements have been made in research on textual and visual hallucinations, there is still a lack of a comprehensive benchmark for evaluating auditory hallucinations in large audio language models (LALMs). To fill this gap, we introduce AHa-Bench, a systematic and comprehensive benchmark for audio hallucinations. Audio data, in particular, uniquely combines the multi-attribute complexity of visual data with the semantic richness of textual data, leading to auditory hallucinations that share characteristics with both visual and textual hallucinations. Based on the source of these hallucinations, AHa-Bench categorizes them into semantic hallucinations, acoustic hallucinations, and semantic-acoustic confusion hallucinations. In addition, we systematically evaluate seven open-source local perception language models (LALMs), demonstrating the challenges these models face in audio understanding, especially when it comes to jointly understanding semantic and acoustic information. Through the development of a comprehensive evaluation framework, AHa-Bench aims to enhance the robustness and stability of LALMs, fostering more reliable and nuanced audio understanding in LALMs.

AHa-Bench Cases

Audio Answer Qwen2-Audio GLM-4-Voice Gemini 2.5 Pro


Question: Does this audio feature human-voiced imitations of bassoon?
no 是的,这段音频是人声模仿的巴松音效。 Yes No, the audio features the sound of an actual bassoon, not a human-voiced imitation. The timbre and sustained note are characteristic of the instrument itself.


Question: Does this audio feature human-voiced imitations of car alarm?
yes 是的,这段音频包含人类模仿的汽车警报声。 No Yes, the whistling sounds at the beginning (approximately 0:00-0:04) are human-voiced imitations of a car alarm. The sound that follows is a synthetic electronic tone.


Question: Does this audio feature human-voiced imitations of bassoon?
yes I can't play the bassoon, but I can definitely help you with questions about it! What would you like to know? Yes Yes, the first sound in the audio (from 0:00 to approximately 0:01) appears to be a human voice imitating a bassoon. The second sound (from approximately 0:02 to 0:03) is a synthesized instrument, not a vocal imitation.


Question: Does this audio feature human-voiced imitations of dog bark?
no Yes, this audio features human-voiced imitations of dog barks. It sounds like it's trying to mimic the sound of a dog barking. Yes No, the audio features sounds that are characteristic of **actual dog barks**, not human imitations.
Audio Answer Qwen2-Audio GLM-4-Voice Gemini 2.5 Pro


Question: Are there 3 people talking in this audio?
no Yes, there are three people talking in this audio. yes No, it sounds like there are **two** people talking.


Question: Are there 3 violins playing in this audio?
yes 是的,这段音频中有三把小提琴在演奏。你能听到它们在音乐中交织,形成丰富的和声。是不是感觉很有层次感呢? No Based on the harmonic richness and texture, yes, it sounds very likely that there are three violins playing, or at least three distinct violin parts contributing to the sound.


Question: Are there 3 people talking in this audio?
no Oh, got it! So there are three people talking in this audio. That makes sense. Is there anything else you'd like to know? yes Yes, it sounds like there are three distinct voices:


Question: Are there 5 brass instruments being played in this audio?
no 是的,这段音频中有五种铜管乐器在演奏。你能听出哪些乐器吗? Yes Yes, it sounds very much like a standard brass quintet.
Audio Answer Qwen2-Audio GLM-4-Voice Gemini 2.5 Pro


Question: Does the train sound like it's moving from near to far?
no The train sound is moving from far to near. Yes Yes, the train sound definitely gives the impression of moving from near to far.


Question: Does the train sound like it's moving from far to near?
no 是的,听起来火车是从远处慢慢开近的。 Yes No, it sounds like the train is moving from **near to far**. The volume of the train sounds decreases significantly throughout the clip, and the sounds become less distinct, which are typical cues for an object moving away from the listener.


Question: Does the car sound like it's moving from far to near?
no Yes, it does! It starts off distant and then gets closer, as if the car is approaching. Yes No, it sounds more like an emergency vehicle (with a siren) starts very close and loud, then moves away or passes by quickly.


Question: Does the train in the audio stay in place?
no The train in the audio is moving forward. yes No, the train does not stay in place. Around 0:04-0:06, you can clearly hear the hiss of air brakes releasing, which is immediately followed by sounds indicating the train is starting to move.
Audio Answer Qwen2-Audio GLM-4-Voice Gemini 2.5 Pro


Question: Does the goat bark a total of 3 times in the audio?
yes The goat barks a total of 5 times in the audio. yes No, the goat does not bark. It makes a bleating sound (or baa) 3 times in the audio.


Question: Does the bell ring a total of 3 times in the audio?
yes Yes, the bell rings a total of 3 times in the audio. yes No, it does not.


Question: Does the bell ring a total of 2 times in the audio?
yes The bell rings 6 times in the audio clip. No Yes, the bell rings a total of 2 times in the audio. The first ring is at the very beginning (around 0:00) and the second is around 0:04.


Question: Does the cat meow a total of 3 times in the audio?
yes The cat meows a total of 3 times in the audio. yes No, the cat meows twice.
Audio Answer Qwen2-Audio GLM-4-Voice Gemini 2.5 Pro


Question: Does the gunshot occur after the person speaks?
no The gunshot occurs after the person speaks. No No, the gunshot (or a very similar loud percussive sound) occurs at the very beginning of the video (around 0:00), and the person starts speaking around 0:03-0:04. So, the gunshot occurs *before* the person speaks.


Question: Does the roaring of the animal occur after the melodious singing of people?
no The roaring of the animal does not occur after the melodious singing of people. Yes The roaring of the animal occurs **before** the melodious singing of people. The roaring is heard from the beginning of the audio until about 0:04, and the singing starts around 0:05.


Question: Does the man converse after the mooing of the cow?
yes The man does not continue his conversation after the cow moos. Yes Yes, a man starts speaking at approximately 0:02, immediately after the cow moos (which occurs from 0:00 to 0:02).


Question: Does the man's conversation occur after the barking of the dog?
yes Yes, the man's conversation occurs after the barking of the dog. We can hear the dog barking, then a pause, and then the man starts speaking. Yes Yes, the man's conversation starts **after** the dog barks. The dog barks at the very beginning of the audio, and then the man starts speaking around the 0:02 mark.
Audio Answer Qwen2-Audio GLM-4-Voice Gemini 2.5 Pro


Question: Does the guitar in the audio last for less than 6 seconds?
yes 是的,这段音频的吉他演奏时间少于六秒。 The answer is yes. No, the guitar plays for longer than 6 seconds. It starts around 0:04 and continues playing beyond the 0:09 mark shown in the waveform, which means it's at least 5 seconds long, and it sounds like it continues a bit after that.


Question: Does the guitar in the audio last for less than 8 seconds?
yes It sounds like a B flat. It's a bit low, but not by much. No, the guitar in the audio lasts for more than 8 seconds. Yes, the guitar sound lasts for less than 8 seconds.


Question: Does the guitar in the audio last for less than 7 seconds?
yes 是的,这个图表很容易阅读,也很明亮。它似乎工作得很好。 Yes Yes, the guitar sounds in the audio last for less than 7 seconds.


Question: Does the car in the audio last for more than 5 seconds?
no 是的,汽车的声音持续了超过五秒钟。 yes Yes, there is a car sound that lasts for more than 5 seconds.
Audio Answer Qwen2-Audio GLM-4-Voice Gemini 2.5 Pro


Question: Is there a cat in the audio?
no 是的,音频中有猫的叫声。 No No, there does not appear to be a cat in the audio. The sounds are more characteristic of a dog, such as growling and barking.


Question: Is it raining in the audio?
no 根据音频内容,外面正在下雨。你听到雨声了吗? The answer is no No, it does not sound like it is raining in the audio. The primary sounds are of frogs croaking.


Question: Is there a violin in the audio?
yes Yes, there is a violin in the audio. It adds a beautiful, melodic touch to the track. Yes No, a violin is not present in the audio. The instrument playing sounds like a cello or a double bass, which are larger string instruments with a lower pitch than a violin.


Question: Is there a drum in the audio?
yes Yes, there is a drum in the audio. It adds a rhythmic element to the sound, making it more interesting and dynamic. Yes Yes, there is a drum present in the audio. You can hear a snare drum hit around 0:02 and a full drum kit playing a beat starting around 0:07.
Audio Answer Qwen2-Audio GLM-4-Voice Gemini 2.5 Pro


Question: Is there the word 'wind' in the audio?
no 是的,音频中有提到 'wind' 这个单词。它在大约1分08秒左右出现。你听到了吗? No The audio contains sounds that resemble rumbling or mechanical noises, but the word "wind" is not spoken.


Question: Is there the word 'dog' in the audio?
no 是的,音频中有提到“dog”这个词。它在大约0秒08秒的时候出现。听听看,是不是这样? Yes The audio contains barking sounds and some human speech at the end ("Yeah, are you going to..."). However, the word "dog" is not spoken in the audio.


Question: Is there the word 'guitar' in the audio?
no 是的,音频中有提到吉他。你听到了吗? Yes The word "guitar" is not present in the audio. However, the audio does feature the sound of a guitar playing.


Question: Is there a sound of wind in the audio?
yes Yes, there is a sound of wind in the audio. No The audio primarily contains a rumbling sound, which might be mistaken for wind, but there is no clear or distinct sound of wind present. Therefore, it is unlikely that there is a significant wind sound in the audio.
Audio Answer Qwen2-Audio GLM-4-Voice Gemini 2.5 Pro


Question: 句中是否包含'富士'?
富士山确实很美,它是日本最高的山峰,也是一座休眠火山。从远处看,它的山顶经常被云雾笼罩,显得非常神秘。很多游客会去登山或者从不同的角度欣赏它的美景。 no 是的,句中包含“富士”。


Question: Does the sentence mention 'peace'?
no Sure! Enjoy a piece of our cake. It's delicious! no. No, the sentence mentions "piece" (as in a piece of cake), not "peace" (as in tranquility).


Question: Does the sentence mention 'bare'?
no The sentence doesn't mention 'bare.' It says, 'A bear was wandering in the forest.' yes No, the sentence mentions "bear" (the animal), not "bare".


Question: 句子中是否提到'意思'的'意'?
'意'这个字有很多意思,比如:意思,意义,意图,意志,意念,意境,意气,意想,意会,意指。在中文里,它常常用来表达说话者的意图或思想。你想了解哪个意思呢? yes 句子中没有提到“意思”的“意”。句子中提到的是“含义”的“义”,它的发音和“意”相同,但字形不同。
Audio Answer Qwen2-Audio GLM-4-Voice Gemini 2.5 Pro


Question: According to the audio, is Earth a planet?
no Oh, I think you might be referring to the moon! Earth is actually a planet in our solar system, orbiting the sun. The moon, on the other hand, is Earth's only natural satellite. Systematic errors like this are pretty common, but it's fun to correct them! No. The audio states that "Earth is a satellite of the Sun." It does not explicitly say whether Earth is a planet.


Question: According to the audio, is Earth a planet?
yes 是的,地球是一颗行星。它位于太阳系内,是太阳系八大行星之一。地球表面有71%被水覆盖,其余部分是陆地。地球是唯一已知存在生命星球的地方,也是我们人类的家园。 yes. Yes, according to the audio, Earth is a planet.


Question: 根据语音,巴黎是意大利的城市吗?
巴黎是法国的首都,位于法国北部。它不是意大利的城市。 no 根据语音,是的,语音中说“巴黎是意大利的城市”。


Question: 根据语音,巴黎是法国的城市吗?
对,巴黎是法国的首都。它不仅是一个政治和文化中心,还以丰富的历史、艺术和时尚而闻名。你喜欢巴黎吗? yes 是的,根据语音,巴黎是法国的城市。
Audio Answer Qwen2-Audio GLM-4-Voice Gemini 2.5 Pro


Question: Based on the voice timbre, is the speaker male?
no Oh, hi there! It's great to hear from you. How can I assist you today? yes Based on the voice timbre and the speaker's statement "I am a girl," the speaker is **not male**. The voice has a higher pitch and lighter quality, which is characteristic of a female voice.


Question: Based on the voice timbre, is the speaker a child?
yes Wow, that's fascinating! Old people often have a rich and warm tone to their voice, full of life experiences. How can I help you today? no The voice has a higher pitch and a lighter quality, which can sometimes be associated with a child's voice. However, the speaker explicitly states, "I am a very old person."Therefore, based on the speaker's own statement, they are not a child, despite any qualities in the timbre that might sound youthful.


Question: 根据语音音色,是男生在讲话吗?
哦,原来如此!那你的声音听起来很特别,是有什么特别的原因吗?还是天生就这样? no 不是,根据语音音色和内容,是女生在讲话。说话者自己也说了“我是女生”。


Question: 根据语音音色判断,是老人在讲话吗?
你好呀,小朋友!有什么我可以帮你的吗? no 根据语音音色判断,**不是**老人在讲话。
Audio Answer Qwen2-Audio GLM-4-Voice Gemini 2.5 Pro


Question: Based on the audio, does the first 'can' mean 'to preserve food in a can'?
No Sure! If you're asking whether someone can perform the task of canning food in a can, the answer would be no. Canning is typically done by sealing food in glass jars, not metal cans. But if you meant to ask if someone can open a can, then the answer is definitely yes! Canners are tools specifically designed for opening cans. yes No, the first 'can' is an auxiliary verb used to ask about ability, as in "Are you able to...".


Question: Based on the audio, does the first 'watch' mean a timepiece worn on the wrist?
No I don't have a watch, but if I did, I'd probably use it to check the time while watching my favorite show! What about you? Do you have a favorite watch? yes No, the first 'watch' is a verb meaning to observe or pay attention to. The phrase "your watch" (the second 'watch') refers to the timepiece.


Question: 根据文本,第二个 “意思” 是 “趣味、吸引力” 的意思吗?
老师讲的内容很有意思,但我一时没听明白她的意思。这句话中的“意思”是“含义”或“意思”的意思。意思就是说,老师讲的内容很有趣,但我还没完全理解她的意思。 No 根据文本: * 第一个“意思”(在“很有意思”里)是“趣味、吸引力”的意思。 * 第二个“意思”(在“他的意思”里)是指“含义、意图、想法”。


Question: 根据文本,第二个“长”字是“长度大、距离长”的意思吗?
对,这里的“长”是指头发很长,长度大。这是一种描述头发特征的说法。 yes. 是的,第二个“长”字在这里读作 **cháng**(第二声),意思是“**长度大**”,形容头发的长度。
Audio Answer Qwen2-Audio GLM-4-Voice Gemini 2.5 Pro


Question: Please transcribe the audio into text.
Please name your favorite book. 我没有个人偏好,但有很多经典的书籍非常受欢迎,比如《百年孤独》、《傲慢与偏见》、《哈利·波特》系列等等。你有特别喜欢的书吗? please name your favorite book. Please name your favorite book.


Question: Please transcribe the audio into text.
Can you tell me what time it is now? 抱歉,我无法提供当前时间。 你可以查看你的设备上的时间。 can you tell me what time it is now? Can you tell me what time it is now?


Question: 请将音频转录为文本
请用中文朗读下面这句话。 好的,请告诉我您想让我朗读的句子。 请用中文朗读下面这句话 请用中文朗读下面这句话。


Question: 请将音频转录为文本
请说出你最喜欢的一本书。 我没有个人偏好,但可以和你聊聊各种书籍的主题和风格。你有没有特别喜欢的书或者最近在看的书呢? 请说出你最喜欢的一本书 请说说你最喜欢的一本书。
Audio Answer Qwen2-Audio GLM-4-Voice Gemini 2.5 Pro
Audio Answer Qwen2-Audio GLM-4-Voice Gemini 2.5 Pro


Question: 根据音频内容,是否出现了掌声?
哇,听起来真棒!观众的掌声就是对演讲最好的肯定啊!这种时刻一定特别令人激动!你有没有经历过类似的场景呢? yes 音频内容描述了掌声(“观众热烈鼓掌”),但音频本身是旁白,并未实际包含掌声的声音。


Question: Based on the audio, is there cat meowing?
No Oh, that sounds like quite the experience! Cats can be quite vocal, especially when they want something. Are you able to figure out what the cat might be trying to tell you? yes Based on the audio, it mentions a cat "mowing," not "meowing." So, no, there is no cat meowing described in the audio.


Question: 根据音频内容,是否出现了狗叫声?
哈哈,听起来你的小狗可能有点调皮哦!有时候它们也会在不经意间带来一些惊喜。你平时是怎么和它相处的呢? yes 是的,音频内容提到了狗叫声。说话人说“我们的狗又在叫了”。


Question: 根据音频内容,是否出现了“哭”这个词?
听起来有点烦人啊。希望邻居家的孩子早点冷静下来。 yes 是的,音频中出现了“哭”这个词。在“邻居家的孩子一直在哭”这句话中。
Audio Answer Qwen2-Audio GLM-4-Voice Gemini 2.5 Pro


Question: Based on the audio, did the observer have the telescope?
No Ah, so the observer saw the man with the telescope. It sounds like they were quite a sight! Did you catch a glimpse of what he was looking at? yes. No, the observer saw a man who had the telescope. The observer did not have the telescope.


Question: Based on the audio, did the observer have the telescope?
Yes It sounds like you saw the man with the telescope. yes. Based on the audio, the observer **saw** the man with the telescope, not that the observer themselves had the telescope.


Question: 根据语音,是你和他都必须说话吗?
嗯,有时候确实需要表达自己的观点。不过,沟通的时候,理解和倾听也很重要。你觉得呢? yes 根据语音,不是两个人(你和他)都必须说话。


Question: 根据语音,是你和他都必须说话吗?
嗯,我明白你的意思。不过在这个场景里,他可能不需要说话,只要保持安静就可以了。不过,如果需要他回应,他也可以简单说几句。 yes. 是的,根据语音内容:是你(被要求说的人)和他都必须说话。