Recent Entries 10
- pattern minor 112d agoPerturbing a Markov chain to be closer to a target stationary distributionSuppose we are a given a Markov chain $A_0 \in \mathbb{R}^{n \times n}$ and a desired stationary probability vector $\pi_0 \in \mathbb{R}^n$. I would like to find a Markov chain that is as close as possible to $A_0$ and whose stationary probability is as close as possible to $\pi_0$. For example, I would like to solve $$ \min_{{\rm stochastic}~ A} ||A-A_0||_F + ||\pi(A)-\pi_0||_2$$ Is this doable in polynomial time? For example, for any fixed $\epsilon > 0$, can I find something $\epsilon$-close to the optimal solution in time polynomial in $n$? If not, what sort of approximation guarantees can we obtain for this? I am pretty flexible on the problem statement; ultimately, the solution will be validated based on performance on a particular application. So if the problem is solvable by replacing the $2$- and Frobenius norms with something else, or via formalizing the problem in some other way, that would also be perfect.
- pattern minor 112d agoHow many possible policies in a Markov Decision Process?If a policy yields an action for a state, how come a 3-state MDP with 2 possible actions, i.e. $S = \{Hot, Mild, Cold\}$, $A = \{East, West\}$, has 8 possible policies? Isn't it 6 if there are 2 possible action for every state?
- principle minor 112d agoAbsorbing Markov Chains: An efficient algorithmic approachFollowing this procedure I have successfully written a program to calculate the probability of ending in a given absorbing state given the initial state. The procedure is as follows: - Given the transition matrix (P), row and column swap until the identity matrix is in the bottom right corner. $$ P = \begin{bmatrix} Q & R \\ 0 & I \end{bmatrix} $$ - Calculate the matrix N, and multiply by R to give the final probability matrix (B). $$N = (I-Q)^{-1}$$ $$B = N*R$$ Where the values in B represent the probability of moving from an initial non-absorbing state (rows) to a final absorbing state. My question is if there is a more efficient method for solving this problem if I know what my initial state is. It seems wasteful to calculate the entire B matrix, given that I will only ever use one row in it. I am writing a program to do this, so the matrix inversion step is particularly inefficient. Can I avoid this altogether?
- pattern minor 112d agoThe transition function in a Markov decision processA Markov decision process is typically described as a tuple $\langle A,U,T,R \rangle $ where - $A$ is the state space - $U$ is the action space - $T: A \times U \times A \mapsto [0,\infty) $ is the state transition probability function - $R:A \times U \times A \mapsto \mathbb{R}$ is the reward function What does this $A \times U \times A$ actually mean in terms of the MDP? It is written in all the papers, but never explained. Does it mean that all the states $a \in A$ are multiplied with all the action $u \in U$? Or something completely different?
- pattern minor 112d agoClarification of the definition of a POMDPFrom what I understand, a $MDP=(G, A, P, R)$ (markov decision process) is represented as: - A complete directed graph $G=(V, E)$ - A set of actions $A_u$ for each vertex $u \in V$ - A reward function $R$ that maps any vertex to some reward, i.e., $R \colon V \mapsto \mathbb{R}$. - A probability function $P$ that gives the probability $P_a(u, v)\in [0, 1]$ of taking edge $(u, v)$ after performing action $a \in A_u$ at node $u$ The behavior of any MDP is talked about in terms of some policy $F$, where $F_0$ is the initial state that the policy starts in, and $F(u)\in A_u$ is the action $F$ takes when at node $u$, which is defined for all $u \in V$. We then say that the MDP starts in $u=F_0$, performs some action $a=F(u)$, then moves to some other vertex $v \in V$ with probability $P_a(u, v)$. It repeats this process, where on turn $t$ it performs an action $b=F(w)$ some node $w \in V$, then moves to some other node $z\in V$ according to probability $P_b(w, z)$. Now, a POMDP (partially observable markov decision process) is defined in a similar manner, except at each state $u \in V$ instead of knowing it's current state, $F$ is given some observation data $O(u) \in S$, where $S$ is any set. $F$ then constructs a "belief state" $b \in [0, 1]^{\vert V \vert }$ based on that information, which is a set of probabilities associated with each node of how much it "believes" it is in that node. Now we define $F$ in terms of what action it performs in a given belief state, where the belief state can change with time. This all makes sense, my question it just that since there are a different set of actions for each node, how does $F$ know what actions $A_u$ it has available without knowing what node $u\in V$ it is in?
- pattern minor 112d agoCan the solution to a POMDP be found using linear programming?It is known that Markov decision processes (MDPs) can be solved using linear programming (see page 24 of Carlos Guestrin's PhD dissertation). The linear program is: $$min_{V(x)} \sum_x \alpha(x)V(x)\\ \text{subject to: } V(x) \ge R(x,a) + \gamma\sum_{x'}P(x'|x,a)V(x')\text{ for all }x\in X, a\in A$$ where $V(x)$ is the value of starting at state $x$ (note that $V(x)$ is the decision variable), parameters $\alpha(x)>0$ for all $x$ which are termed state relevance weights and are typically normalized, that is $\sum_x\alpha(x)=1$ (the solution of the above LP is actually, surprisingly, independent of the exact choice of $\alpha$), $R(x,a)$ is the reward obtained in state $x$ when action $a$ is taken, and $P(x'|x,a)$ is the probability that transition to state $x'$ when we take action $a$ in state $x$. Question: does anyone know of a similar linear programming formulation for solving partially observable Markov decision processes (POMDPs)?
- principle minor 112d agoAverage vs Worst-Case Hitting TimeConsider a simple random walk on an undirected graph and let $H_{ij}$ be the hitting time from $i$ to $j$. How much bigger can $$ H_{\rm max} = \max_{i,j} H_{ij}, $$ be compared to $$ H_{\rm ave} = \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n H_{ij}.$$ For all the examples I can think of, these two quantities are of roughly the same order of magnitude. To make this into a formal question, define $$\phi(n) = \max_{\mbox{undirected graphs with } n \mbox{ nodes }} \frac{H_{\rm max}}{H_{\rm ave}}.$$ How fast does $\phi(n)$ grow with $n$?
- pattern minor 112d agoSeemingly non sequitur in proofI'm trying to understand a small proof in an article about computing lumpability on Markov chains. There is a small detail that I cannot understand, i.e. I don't think it follows from the argument. The article is Simple $O(m\log{n})$ Time Markov Lumping written by Valmari & Franceschinis. Given a weighted graph they define the notion of compatible partition of its nodes. Which is a partition $P$ such that for all $B, B' \in P$ and all $s_1, s_2 \in B$ we have that: $$ W(s_1, B') = W(s_2, B') $$ They describe an algorithm to compute the coarsest refinement of an initial partition that is compatible (the coarsest croip= compatible refinement of initial partition). The algorithm follows the idea of Paige and Tarjan to compute bisimulation: we take a block $S$ from the partition as a splitter and check whether some other block $B$ breaks the condition above, and if it does we split $B$ into $B_1, \ldots, B_k$ such that the $B_i$ satisfy the condition above when taking $B' = S$. How exactly these checks are performed shouldn't be essential for the correctness of the algorithm. In the proof of correctness they prove the following lemma: Lemma 2. Let $d_1, d_2$ be nodes of the graph. If the algorithm ever puts $d_1$ and $d_2$ into different blocks, then there is no croip where $d_1$ and $d_2$ are in the same block. The proof for when the algorithm separates $d_1$ and $d_2$ during its execution is the following: Proof. We show that it is an invariant property of the main loop of the algorithm (that is, always valid on line 2*) that if two states are in different blocks of the algorithm, then they are in different blocks in every croip. If $d_1$ and $d_2$ are in different blocks initially, then they are in different blocks in $I$ and thus in every croip. The case remains where lines 14 to 20 separate $d_1$ and $d_2$ to different blocks. This happens only if $W(d_1, S) \neq W(d_2, S)$. Let $P$ be an arbitrary croip. It follows from
- pattern moderate 112d agoWhat are Markov chains?I'm currently reading some papers about Markov chain lumping and I'm failing to see the difference between a Markov chain and a plain directed weighted graph. For example in the article Optimal state-space lumping in Markov chains they provide the following definition of a CTMC (continuous time Markov chain): We consider a finite CTMC $(\mathcal{S}, Q)$ with state space $\mathcal{S} = \{x_1, x_2, \ldots, x_n\}$ by a transition rate matrix $Q: \mathcal{S} \times \mathcal{S} \to \mathbb{R}^+$. They don't mention the Markov property at all, and, in fact, if the weight on the edges represents a probability I believe the Markov property trivially holds since the probability depends only on the current state of the chain and not the path that lead to it. In an other article On Relational Properties of Lumpability Markov chains are defined similarly: A Markov chain $M$ will be represented as a triplet $(S, P, \pi)$ where $S$ is the finite set of states of $M$, $P$ the transition probability matrix indicating the probability of getting from one state to another, and $\pi$ is the initial probability distribution representing the likelyhood for the system to start in a certain state. Again, no mention of past or future or independence. There's a third paper Simple O(m logn) Time Markov Chain Lumping where they not only never state that the weights on the edges are probabilities, but they even say: In many applications, the values $W(s, s')$ are non-negative. We do not make this assumption, however, because there are also applications where $W(s, s)$ is deliberately chosen as $-W(s, S \setminus \{s\})$, making it usually negative. Moreover, it's stated that lumping should be a way to reduce the number of states while maintaining the Markov property (by aggregating "equivalent" state into a bigger state). Yet, to me, it looks like it's simply summing probabilities and it shouldn't even guarantee that the resulting peobabilities of the trans
- pattern minor 112d agoWhy are HMMs appropriate for speech recognition when the problem doesn't seem to satisfy the Markov propertyI'm learning about HMMs and their applications and trying to understand their usages. My knowledge is a bit spotty, so please correct any incorrect assumptions I'm making. The specific example I'm wondering about is for using HMMs for speech detection, which is a common example in literature. The basic method seems to be to treat the incoming sounds (after processing) as observations, where the actual words being spoken are the hidden states of the process. It seems obvious the hidden variables here are not independent, but I do not understand how they satisfy the Markov property. I would imagine that the probability of the Nth word is not just dependent on the N-1 word, but on many preceding words before that. Is this simply ignored as a simplifying assumptions because HMMs are very good at correctly modeling speech detection problems, or am I not clearly understanding what the states and hidden variables in the process are? The same problem would appear to apply to a great deal of applications in which HMMs are quite popular, POS tagging, and so forth.