patternMinor

Why are HMMs appropriate for speech recognition when the problem doesn't seem to satisfy the Markov property

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

problemwhyrecognitionthespeechseemarehmmssatisfyappropriate

Problem

I'm learning about HMMs and their applications and trying to understand their usages. My knowledge is a bit spotty, so please correct any incorrect assumptions I'm making. The specific example I'm wondering about is for using HMMs for speech detection, which is a common example in literature.

The basic method seems to be to treat the incoming sounds (after processing) as observations, where the actual words being spoken are the hidden states of the process. It seems obvious the hidden variables here are not independent, but I do not understand how they satisfy the Markov property. I would imagine that the probability of the Nth word is not just dependent on the N-1 word, but on many preceding words before that.

Is this simply ignored as a simplifying assumptions because HMMs are very good at correctly modeling speech detection problems, or am I not clearly understanding what the states and hidden variables in the process are? The same problem would appear to apply to a great deal of applications in which HMMs are quite popular, POS tagging, and so forth.

Solution

On that subject I recommend you to read a very good paper by James Baker and others who were actually responsible for introduction of HMM in speech:

A Historical Perspective of Speech Recognition
http://cacm.acm.org/magazines/2014/1/170863-a-historical-perspective-of-speech-recognition/abstract

Using Markov models to represent language knowledge was controversial.
Linguists knew no natural language could be represented even by
context-free grammar, much less by a finite state grammar. Similarly,
artificial intelligence experts were more doubtful that a model as
simple as a Markov process would be useful for representing the
higher-level knowledge sources recommended in the Newell report.
However, there is a fundamental difference between assuming that lan-
guage itself is a Markov process and modeling language as a
probabilistic function of a hidden Markov process. The latter model is
an approximation method that does not make an assumption about
language, but rather provides a prescription to the designer in
choosing what to represent in the hidden process. The definitive
property of a Markov process is that, given the current state,
probabilities of future events will be independent of any additional
information about the past history of the process. This property
means if there is any information about the past history of the ob-
served process (such as the observed words and sub-word units), then
the designer should encode that information with distinct states in
the hidden process. It turned out that each of the levels of the
Newell hierarchy could be represented as a probabilistic function of
a hidden Markov process to a reasonable level of approximation. For
today’s state-of-the-art language modeling, most systems still use
the statistical N-gram language models and the variants, trained with
the basic counting or EM-style techniques. These models have proved
remarkably powerful and resilient. However, the N-gram is a highly
simplistic model for realistic human language. In a similar manner
with deep learning for significantly improving acoustic modeling
quality, recurrent neural networks have also significantly
improved the N-gram language model. It is worth noting that
nothing beats a massive text corpora matching the application domain
for most real speech applications.

Overall, the Markov model is pretty generic model for decoding black-box channel with very relaxed assumption on the transmission thus it is a perfect fit for the speech recognition, however, the question remains what to encode as a state indeed. It is clear that states should be more complex objects than what we assume now (just few preceding words). It is ongoing research to reveal true nature of such structure.

Context

StackExchange Computer Science Q#37709, answer score: 8

Revisions (0)

No revisions yet.