#bec search is so unreliable
Explore tagged Tumblr posts
Text
tumblr search function is so insane because like. why do I have to search something on my blog, scavenge through the collection of posts tumblr has decided to show me-regardless of whether they include the tag I'm looking for- and then when I do find a post with the right tag I can simply press the tag from there and it magically shows me the numerous posts that I knew had that tag
why does it take all this gymnastics
#something ill never know#but im glad i found out#sometimes i make a quick post with the tag i want to search so i can press it from there#bec search is so unreliable#but it works alot better when you can press a tag from a different post and somehow search works then#sage.words
2 notes
·
View notes
Text
Algorithmic and Information Theoretic Decoding Thresholds for Low density Parity-Check Code
by Jeremy Dohmann, Vanessa Wong, Venkat Arun
Abstract
We will discuss error-correcting codes: specifically, low-density parity-check (LDPC) codes. We first describe their construction and information-theoretical decoding thresholds, .
Belief propagation (BP) (see Tom’s notes) can be used to decode these. We analyze BP to find the maximum error-rate upto which BP succeeds.
After this point, BP will reach a suboptimal fixed point with high probability. This is lower than the information-theoretic bound, illustrating a gap between algorithmic and information-theoretic thresholds for decoding.
Then we outline a proof of a theorem suggesting that any efficient algorithm, not just BP, will fail after the algorithmic threshold. This is because there is a phase transition at the algorithmic threshold, after which there exist an exponentially large number of suboptimal `metastable’ states near the optimal solution. Local search algorithms will tend to get stuck at these suboptimal points.
Introduction to Error Correcting Codes
Motivation
Alice wants to send a message to Bob, but their channel of communication is such that Bob receives a corrupted version of what Alice sent. Most practical communication devices are imperfect and introduce errors in the messages they are transmitting. For instance, if Alice sends 3V, Bob will really receive three volts plus some noise (we have chosen to ignore some inconvenient practical details here). In many cases, this noise is quite small, e.g. it could be less than 0.5V in 99.99% of cases. So, in principle, Alice could have had very reliable delivery by just choosing to always send 0V for a logical 0 and 5V for logical 1, using checksums to detect the occasional error. But this is wasteful. Alice could have squeezed more levels between 0 and 5V to get a higher bitrate. This causes errors, but Alice can introduce redundancy in the bits she is transmitting which can enable Bob to decode the correct message with high probability. Since it is much easier to control redundancy in encoding than in physical quantities, practical communication devices often choose choose to pack enough bits into their physical signal that errors are relatively quite likely, relying instead on redundancy in their encoding to recover from the errors. Redundancy is also used in storage, where we don’t have the option of detecting an error and retransmitting the message.
Some errors in communication are caused by thermal noise. These are unpredictable and unavoidable, but the errors they cause can be easily modeled; they cause bit-flips in random positions in the message. There are other sources of error. The clocks on the two devices may not be correctly synchronized, causing systematic bit-flips in a somewhat predictable pattern. A sudden surge of current (e.g. because someone turned on the lights, and electricity sparked between the contacts) can corrupt a large contiguous segment of bits. Or the cable could simply be cut in which case no information gets through. These other kinds of errors are often harder to model (and easier to detect/mitigate), so we have to remain content with merely detecting them. Thus for the remainder of this blog post, we shall restrict ourselves to an error model where each bit is corrupted with some fixed (but potentially unknown) probability, independent of the other bits. For simplicity, we shall primarily consider the Binary Erasure Channel (BEC), where a bit either goes through successfully, or the receiver knows that there has been an error (though we will introduce some related channels along the way).
Claude Shannon found that given any channel, there is a bitrate below which it is possible to communicate reliably with vanishing error rate. Reliable communication cannot be achieved above this bitrate. Hence this threshold bitrate is called the channel capacity. He showed that random linear codes are an optimal encoding scheme that achieves channel capacity. We will only briefly discuss random linear codes, but essentially they work by choosing random vectors in the input space and mapping them randomly to vectors in the encoded space. Unfortunately we do not have efficient algorithms for decoding these codes (mostly due to the randomness in their construction), and it is conjectured that one doesn’t exist. Recently Low-Density Parity Check (LDPC) codes have gained in popularity. They are simple to construct, and can be efficiently decoded at error levels quite close to the theoretical limits.
With LDPC codes, there are three limits of interest for any given channel and design bitrate (M/N): 1) the error level upto which an algorithm can efficiently decode them, , 2) the error level level upto which they can be decoded by a computationally unbounded decoder, and, 3) the error level beyond which no encoding scheme can achieve reliable communication, . Obviously, , and in general these inequalities can be strict. Our goal here is to study the gap between and . We will sometimes refer to these three quantities as , , and when discussing channels besides the BEC. This is following the notation of Mezard and Montanari (2009).
More formally, information theory concerns reliable communication via an unreliable channel. To mitigate the errors in message transmission, error correcting codes introduce some type of systematic redundancy in the transmitted message. Encoding maps are applied to the information sequence to get the encoded message that is transmitted through the channel. The decoding map, on the other hand, is applied to the noisy channel bit (see Figure below). Each message encoded is comprised of bits and redundant sequences of bits in an error correcting code. possible codewords form a “codebook” in binary space .
Claude Shannon’s code ensembles proved that it is easier to construct stochastic (characterized by good properties and high probability) models vs. deterministic code designs. Stochastic models were able to achieve optimal error correcting code performance in comparison to a more rigidly constructed model, proving that it was possible to communicate with a vanishing error probability as long as the rate of transmission is smaller than the channel capacity, a measure of the maximum mutual information between channel input and output.
Fig 1 Schematic of the communication model of information communication.
Thus, in order to construct an optimal error correcting code, one must first define the subset of the space of encoding maps, endow the set with probability distributions, and subsequently define the associated decoding map for each of the encoding maps in the codes. We have included a section in the A that gives a thorough discussion of random code ensembles which are known to achieve optimal decoding, whether via scoring decoding success by bit error rate or decoded word error rate. We will also show a perspective which uses principles from statistical physics to unify the two (often called finite-temperature decoding). From hereon out we will discuss LDPC and explore how the values of , and for various channels reveal deep things about the structure of the decoding solution space.
Low-density Parity Check Code
LDPC codes are linear and theoretically excellent error correcting codes that communicate at a rate close to the Shannon capacity. The LDPC codebook is a linear subspace of . For an MxN sparse matrix , the codebook is defined as the kernel:
where all the multiplications and sums involved in are computed modulo 2.
Matrix is called the parity check matrix and the size of the codebook is . Given this code, encoding is a linear operation when mapping an binary generating matrix (the codebook is the image of ) such that ).
Every coding scheme has three essential properties that determine its utility: the geometry of its codebook and the way it sparsely distributes proper codewords within the encoding space (c.f. our mention of sphere packing as a geometric analogy for RLCs), the ease with which one can construct a code which sparsely distributes codes within the encoding space, and the existence of fast algorithms to perform effective decoding.
A coding scheme over a given channel (whether it be BSC, BEC, AWGN, etc.) also has three parameters of interest, which is the error rate above which some chosen algorithm cannot perform error-free decoding, above which even exhaustively enumerating over all codewords in the codebook and calculating the MAP probability does not successfully decode, which is the capacity of the channel, an error rate above which no decoding scheme could perform error-free decoding.
LDPC codebook geometry and
On the subject of codebook geometry, it is well known that LDPC ensembles, in expectation, produce sparsely distributed codewords. This means that valid codewords (i.e. those that pass all parity checks) are far apart from one another in Hamming space and thus require a relatively large number of bits to be lost in order for one codeword to degenerate into another one. There is an important property called the distance enumerator which determines the expected number of codewords in a tight neighborhood of any given codeword. If for a given distance the expected number is exponentially small, then the coding scheme is robust up to error rates causing that degree of distortion. We discuss a proof of LDPC distance properties in Appendix B and simply state here that LDPCs are good at sparsely distributing valid codewords within the encoding space. The property , introduced above is intimately related to the geometry of the codebook and the properties of the noisy channel being decoded over.
The information theoretic threshold, is the noise level above which MAP decoding no longer successfully performs decoding. is important because it is the error value above which we could theoretically always perform (slow) decoding below by enumerating all codewords in the codebook and calculating the one with the highest MAP probability.
Every LDPC ensemble has some different for a given channel. WFor now we will simply remark that LDPC ensembles are effective because they can be chosen such that closely approaches the limit for many channels. Even more importantly, we will show that it is likely that there is no fast algorithm for which = for general LDPCs over a general noisy channel.
We will not derive the for any LDPC ensembles over any channels (I recommend chapter 15 of this book for details), but we will, in section 3, present results derived by other researchers.
Ease of construction
On the subject of ease of construction and ease of decoding, there is a much simpler graphical representation of LDPC codes which can be used to demonstrate LDPC tractability.
LDPCs can be thought of as bipartite regular graphs, where there are N variable nodes which are connected to M parity check nodes according to randomly chosen edges based on the degree distribution of the LDPC. Though the appendix discusses general degree distributions we will discuss here only (d,k) regular bipartite graphs, in which all variables have d edges and all parity check nodes have k edges, and how to generate them under the configuration model.
The configuration model can be used to generate a bipartite graph with (d,k) degree distribution by initially assigning all variable nodes d half-edges, all parity check nodes k half-edges, and then randomly linking up half-edges between the two sets, deleting all nodes which end up being paired an even number of times, and collapsing all odd numbered multi-edges into a single edge. This system doesn’t work perfectly but for large N, the configuration model will generate a graph for which most nodes have the proper degree. Thus it is relatively easy to generate random graphs which represent LDPC codes of any desired uniform degree distribution. An example of this graphical representation is in figure 2
Fig 2 Factor graph of a (2,3) regular LDPC code with factor nodes as black squares and variable nodes as white circles, and notation for BP messages.
How the graphical model relates to fast decoding
The graphical model of LDPCs is useful because it is both easy to construct and presents a natural way to perform fast decoding. In fact, the fast graph-based decoding algorithm, Belief Propagation, we use has a which likely represents the upper limit on fast decoding for LDPCs.
We have seen recently that bipartite graphical models which represent a factorized probability distribution can be used to calculate marginal probabilities of individual variable nodes (what a mouthful!).
Basically, if the structure of the graph reflects some underlying probability distribution (e.g. the probability that noisy bit was originally sent as bit ) then we can use an iterative algorithm called Belief Propagation (see blog post above) to actually converge to the exact probability distribution for each .
This is important because when we perform decoding, we would like to estimate the marginal probability of each individual variable node (bit in our received vector), and simply set the variable to be the most likely value (this is known as bit-MAP decoding, discussed earlier). As mentioned above, under certain conditions the Belief Propagation algorithm correctly calculates those marginal probabilities for noise rates up to an algorithmic threshold .
The Belief Propagation algorithm is an iterative message passing algorithm in which messages are passed between variable nodes and parity check/factor nodes such that, if the messages converge to a fixed point, the messages encode the marginal probabilities of each variable node. Thus BP, if it succeeds can perform bit-MAP decoding and thus successfully decode.
We will show in the next section how the configuration model graphs map to a factorized probability distribution and mention the for BP. In the following section we will show an example of decoding over the binary erasure channel, then finally we will show motivation to suggest that the for BP over LDPCs represents a hard upper limit above which no fast decoding algorithms exist.
Decoding Errors via Belief Propagation
As mentioned above (again, please see Tom’s excellent blog post for details), the belief propagation algorithm is a useful inference algorithm for stochastic models and sparse graphs derived from computational problems exhibiting thresholding behavior. As discussed, symbol/bit MAP decoding of error correcting codes can be regarded as a statistical inference problem. In this section, we will explore BP decoding to determine the threshold for reliable communication and according optimization for LDPC code ensembles in communication over a binary input output symmetric memoryless channel (BSC or BMS).
Algorithm Overview
Recall that the conditional distribution of the channel input given the output is given by and that we wish to find the that maximizes the below probability given
(1)
Where is the conditional probability of of observing noisy bit given that was sent. For the BSC we have and .
Furthermore
is an indicator variable which takes value if satisfies parity check a and 0 otherwise. In particular, the product of these indicators takes into account the fact that 0 probability should be assigned to hypotheses which aren’t in the code book (indicated by at least one parity check failing).
We would like to design a message passing scheme such that the incoming messages for a given variable node encode their marginal probabilities .
Note, first and foremost that this probability can be factorized a la BP factor graphs such that there is a factor node for each parity check node which contributes probability
and a factor node for each channel probability term . Since each channel probability term is only connected to a single variable, its message never gets updated during BP and so we omit it from the factor graphs (e.g. note that figure 2 only has parity check nodes and variable nodes)
The message passing scheme ends up taking the form
(2)
(3)
Where denotes the neighborhood of factor node and the sum in (3) is over all possible configurations of the neighbors of not including .
Messages are passed along the edges as distributions over binary valued variables described by the log-likelihoods
(4)
(5)
We also introduce the a priori log likelihood for bit given the received message :
Once we parametrize the messages as log-likelihoods, it turns out we can rewrite our update rules in terms of the parametrized values h and u, making updates much simpler:
(6)
(7)
Given a set of messages, we would perform decoding via the overall log likelihood . Where gets decoded to 0 for and 1 for .
Typically BP is run until it converges to a set of messages that decode to a word in the codebook, or until a max number of iterations have occurred. Other stopping criteria exist such as the messages between time step t and t+1 being all within some small of one another.
It is important to note some properties of BP:
BP always terminates in steps if the factor graph is a tree of depth
It is not known under what circumstances so called “loopy” BP will converge for non-tree graphs
Because factor graphs of LDPC codes are relatively sparse, they appear “locally tree-like”, a property which is believed to play a crucial role in BP convergence over the factorized probability distribution used in LDPC MAP decoding (eqn 1). As mentioned above BP manages to converge on many sorts of non tree-like graphs given that they have “nice” probability distributions. For example the SK model is known to converge even though the underlying factor graph is a complete graph!
It turns out that BP converges under some noise levels for LDPC decoding, and that the threshold at which it fails to converge, , represents a phase transition to a generically different regime in the solution space of the codebook. It’s been noted elsewhere that the BP threshold is often the threshold of fast solving for many cool problems; e.g. k-SAT. This is because it is often thought to generically represent the “best” possible local (ergo “fast”) algorithm for those problems
In appendix C we will show some important properties of BP. The following tables summarize important results for several ensembles and channels. Note how close the information theoretic threshold for LDPCs is to the actual shannon limit for the channels below.
Table 1: Thresholds for BSC Various thresholds for BP over LDPC codes in a Binary Symmetric Channel
dkShannon limit34.1669.2101.214535.1138.1384.146136.084.101.1146.1169.1726.174
See Mezard and Montanari, 2009 Chapt 15. for this table
Table 2: Thresholds for BEC Various thresholds for BP over LDPC codes in a Binary Erasure Channel
dkShannon limit34.65.746.7535.52.59.636.429.4882.546.506.66566.6667
See Mezard and Montanari, 2009 Chapt 15. for this table
We will now show exact behavior of the (3,6) LDPC ensemble over the binary erasure channel.
Algorithmic Thresholds for Belief Propagation (BP)
Definitions and notation
Definition 1. In a Binary Erasure Channel (BEC), when the transmitter sends a bit , the receiver receives the correct bit with probability or an error symbol with probability .
For BECs, the Shannon capacity—the maximum number of data bits that can be transmitted per encoded bit—is given by .
Definition 2. An -bit Error Correcting Code (ECC) is defined by a codebook . The transmitter encodes information as an element of . The receiver receives a corrupted version of the transmitted codeword. To decode, it picks an that is most likely given and the channel characteristics.
For ease of discourse, we have refrained from defining ECC in full generality.
Definition 3. An -bit Low Density Parity Check Code (LDPC) is an ECC with . Here is an matrix and arithmetic is over . is a sparse parity-check matrix. and are finite polynomials that characterize ; is the fraction of columns with s and is the fraction of rows with s. Since these are fractions/probabilities, they must be normalized. Hence . is a random matrix, and therefore has full rank with high probability.
In an LDPC code, . Hence every bits contain bits of information, making the rate . Over binary erasure channels (BECs), on receiving , the decoder must choose an such that such that , and (, denote the bit of and respectively). That is, the bits that were successfully transmitted should be preserved; other bits should be chosen to satisfy the parity check equations. If multiple correct choices of are possible, then cannot be unambiguously decoded.
BP/Peeling algorithm
In general, decoding can be computationally hard. But there exists an error rate , a function of and , below which belief propagation succeeds in decoding. Let be the maximum error rate upto which successful decoding is possible (i.e. we can unambiguously determine the transmitted codeword) and be the Shannon limit, then . In general, these inequalities can be strict, illustrating the gap between what is information theoretically possible, and what is computationally feasible.
Belief propagation (BP) for decoding LDPC-codes is equivalent to a simple peeling algorithm. Let us first describe the factor-graph representation for decoding. This is denoted in figure 3. Variables on the left are the received symbol . Factor nodes on the right denote the parity-check constraint (rows of ). The XOR of variables connected to each factor node must be 0.
BP/the peeling algorithm works as follows. For simplicity of exposition, consider that the all zeros code-word has been transmitted. Since this is a linear code, there is no loss of generality. At first, only the variables that were successfully transmitted are fully determined. In the next round, the factor nodes that have exactly one undetermined variable can determine that variable using their parity-check constraint.
Fig 3 An example of a received code-word and corresponding parity-check constraints that is information-theoretically determined (only the all zeros codeword satisfies all constraints), but cannot be decoded by the belief propagation algorithm because no factor node has exactly one unknown variable. Here, only the V0 is correctly received. Constraints F1, F2, F3 and F4 imply that V1=V2=V3=V4=V5. If V0=0, then all the rest must be 0s to satisfy F0 (note, the only other valid codeword is all ones).
BP isn’t perfect
This algorithm is not perfect. Figure 3 is an example of a received codeword which can be unambiguously decoded — only the all zeros codeword satisfies all the constraints— but the BP algorithm fails, because at any point, all factor nodes have more than one unknown variable. It seems that the only way to solve problems like that is to exhaustively understand the implications of the parity-check equations. If this examples seems contrived, that is because it is. Decoding becomes harder as the degree and number of constraints increases; we had to add a lot of constraints to make this example work. Fortunately, if the graph is sparse, BP succeeds. We prove this in the following theorem:
Phase transitions for BP
Theorem 1. A LDPC code can be decoded by BP as when the error rate is less than :
Proof. To prove this, let us analyze the density evolution. For BECs, this is particularly simple as we only need to keep track of the fraction of undetermined variables and factor nodes at timestep and respectively. As , these fractions are probabilities. A factor node is determined when all of its variables are determined (note: if all but one is determined, the last one can be immediately determined). The following recursion relations hold:
(8)
The first holds because a variable node is undetermined at timestep if it was originally undetermined (which happens with probability ) and if it isn’t determined in the last step, which happens with probability say . Now,
A similar reasoning holds for the second relation. is the probability that a given neighboring variable node is determined. is the probability that at-most one is undetermined, and hence this function node is determined. is the probability that this function node is undetermined.
Composing the two relations in equation 8, we get the recursion:
An example of is shown in Figure~\ref{fig:F-recursion} for . That is a (3,6) regular graph where variable nodes and function nodes have 3 and 6 neighbors respectively. On the left, is always below . Hence the recursion with starting from the far right will converge to the fixed point . But on the right, is large enough that intersects at a non-zero point. Hence the recursion will converge to the higher fixed point instead, without ever reaching the `correct’ fixed point. BP therefore gets stuck at a suboptimal solution, though information-theoretically a correct solution exists. This can be interpreted as a glassy state, where many deep local minima are present, and BP will converge to the wrong minimum.
The condition for BP to converge is . Hence the threshold error rate, , below which this condition holds is:
For (3, 6) regular graphs, ∎
Fig 4 The recursion relation for a (3,6) regular graph, where is plotted against . The identity function is also shown (in blue). Here . The graph above and below show the case the error rate is below and above respectively.
Another interesting phase transition can be observed. As increases, for some values of , the first intersection of and happens at a non-zero point. For others, it starts of at and goes up continuously. In the former case, the decoding error rate jumps discontinuously as increases from 0 to a non-zero values. For the latter, it increases continuously.
To see the gap between and what can be information theoretically, we look at what happens when the degrees of the LDPC code is increased while keeping the rate constant. Specifically consider the regular graph (i.e. and ) as while is fixed. Note that the rate of the code is . This is shown in Figure~\ref{fig:ed-lim-inf}. decreases toward 0. But as , it should become information-theoretically easier to decode. In fact, as , the code approaches a random linear code, which is known to achieve Shannon capacity. Hence we can believe that the information-theoretically achievable decoding rate is non-decreasing. Thus there is a gap between what is information theoretically possible to decode, and what is computationally feasible using Belief Propagation.
Fig 5 decreases as increases, while the rate is fixed. In fact .
Finally we would like to mention that it is possible to choose a sequence of polynomials such that approaches the Shannon limit. While it is non-trivial to sample exactly from this distribution, good approximations exist and LDPC codes can achieve close to channel capacity over binary erasure channels.
The solution space in
The energy landscape of LDPC decoding
We have already shown the exact location of the threshold above which decoding is not possible for the LDPC ensemble and have also investigated the point at which the BP algorithm fails, .
It should not be surprising to us that any given algorithm we attempt to throw at the problem fails at a certain point below . In fact there are many simple, random algorithms from the class of Markov-chain Monte Carlos which give fast run times but which fail at values far below even . The failing point of a particular algorithm, per se, is not necessarily very significant. We shouldn’t expect that any given algorithm, besides explicitly calculating the symbol MAP by traversing the entire codebook, would be able to achieve .
What is of interest to us here, is that marks a provable threshold in the solution space of LDPC decoding during which it is likely no locally-based methods, and therefore no fast algorithms can decode with nonzero probability. We will show later precisely the number and energy levels of these metastable states for the BEC. Proof of this transition for other channel types is outside the scope of this lecture.
In this section we will rephrase decoding as an energy minimization problem and use three techniques to explore the existence of metastable states and their effect on local search algorithms.
In particular we will first use a generic local search algorithm that attempts to approximately solve energy minimization expression of decoding.
We will next use a more sophisticated Markov chain Monte Carlo method called simulated annealing. Simulated annealing is useful because it offers a perspective that more closely models real physical processes and that has the property that its convergence behavior closely mimics the structure of the metastable configurations.
Energy minimization problem
To begin, we will reframe our problem in terms of constraint satisfaction.
The codewords of an LDPC code are solutions of a CSP. The variables are the bits of the word and the constraints are the parity check equations. Though this means our constraints are a system of linear equations, our problem here is made more complicated by the fact that we are searching for not just ANY solution to the system but for a particular solution, namely the transmitted codeword.
The received message tells us where we should look for the solution.
Assume we are using the binary-input, memoryless, output-symmetric channel with transition probability .
The probability that was the transmitted codeword, given is
Where
(10)
We can associate an optimization problem with this code. In particular, define to be twice the number of parity check equations violated by .
We have already discussed how symbol MAP computes the marginals of the distribution and how word MAP finds its argmax.
We shall here discuss two related problems
optimizing the energy function within a subset of the configuration space defined by the received word
sampling from a ’tilted’ Boltzmann distribution associated with the energy
Define the log-likelihood of x being the input given the received y to be
(11)
If we assume WLOG that the all zero codeword was transmitted, by the law of large numbers, for large N the log-likelihood of this codeword is close to where is the channel entropy. The probability of an order-N deviation away from this value is exponentially small.
This suggests that we should look for the transmitted codeword amongst those such that is close to h.
The constraint version of our decoding strategy – known as typical-pairs decoding – is thus, find such that . This constraint will be referred to as the `distance constraint’ and we should consider the situation where if exactly one codeword satisfies the distance constraint, return it.
Since codewords are global energy minima ( for all ), we can phrase typical-pairs decoding as an optimization problem
Minimize subject to .
This decoding succeeds iff the minimum is non-degenerate. This happens with high probability for and with zero probability for . In particular, there are exponentially many degenerate energy minima above .
Similar to what we have seen elsewhere in the course, there exists a generically intermediate regime in which the global energy minimum is still the correct codeword bu there is an exponentially large number of local energy minima obscuring it (see figure 6).
What is so special about BP is that the threshold at which these exponentially many metastable states proliferate is exactly the algorithmic threshold for BP which we proved earlier.
Fig 6 A cartoon landscape for the Energy function defined above (number of violated checks). Left: The energy has a unique global minimum with (the transmitted codeword) and no local minima. Center: many deep local minima appear, although the global minimum is non degenerate. Right: more than one codeword is compatible with the likelihood constraint, the global minimum is degenerate adapted from Mezard and Montanari, 2009 Chapt 21
While finding solutions amounts to Gaussian elimination, the constraint is not a linear constraint. Thus one needs to use some sort of more advanced search procedure to find satisfying vectors .
We will show that if one resorts to local-search-based algorithms, the metastable states above block the algorithm. Furthermore, we suggest that the behavior of the local algorithms discussed below are typical of all local search algorithms (including BP) and that it is very likely the case that no fast algorithm exists capable of finding global energy minima without getting caught in the metastable states which proliferate above .
Below is the simplest of local search algorithms, -local search.
Fig 7 Excerpted from Mezard and Montanari, 2009 Chapt 21
Delta search typefies local search algorithms. It walks semi-randomly through the landscape searching for low energy configurations. Its parameter is defined such that, when stuck in a metastable state it can climb out of it in polynomial time if the steepness of its energy barrier is . Thus its failure in the region suggests that there are no barriers of constant size and that barriers of order N are the norm.
MCMC and the relaxation time of a random walk
We can understand the geometry of the metastable states in greater detail by reframing our MAP problem as follows:
(12)
This form is referred to as the `tilted’ Boltzmann because it is a Boltzmann distribution biased by the likelihood function.
In the low temperature limit this reduces to eqn 10 because it finds support only over words in the codebook.
This distribution more closely mimics physical systems. For nonzero temperature it allows support over vectors which are not actually in our codebook but still have low distance to our received message and have low energy – this allows us to probe the metastable states which trap our local algorithms. This is referred to as a code with `soft’ parity check constraints as our distribution permits decodings which fail some checks.
We will use the following algorithm excerpted from Mezard and Montanari Chapt 21:
Fig 8 Excerpted from Mezard and Montanari, 2009 Chapt 21
Where a Glauber update consists of scanning through the bits of the current proposed configuration and flipping the value of bit with probability
(13)
Where is the current configuration and is the configuration obtained by flipping ‘s bit
This method is a spin-off of traditional Markov chain Monte-Carlo algorithms with the variation that we lower the temperature according to an annealing schedule that initially assigns probability to all states proportional to the likelihood component of equation 12, allowing the chain to randomly sample the configuration space in the neighborhood of the received noisy word, until in the low temperature limit it becomes concentrated near to configurations which are proper codewords.
This method is useful to us because MCMCs are good models of how randomized methods of local searching for optimal configurations occurs in physical systems. Furthermore, the convergence of MCMCs and the time it takes them to converge tells us both the properties of the energy wells they terminate in and the barriers between minima in the energy landscape.
Let’s now show a property relating convergence times of MCMCs and energy barriers known as the Arrhenius law.
If we take the example of using a simple MCMC random walk with the update rule below over the following landscape
Fig 9 This represents a random walk along a line in which there are two ground states separated by an energy barrier of height . Excerpted from Mezard and Montanari, 2009 Chapt 13
We find that the expected number of time steps to cross from one well to another is governed by the Arrhenius law .
In general, if there exists a largest energy barrier between any two components of the configuration space (also known as the bottleneck) the time it takes to sample both components, also known as the relaxation time of the MCMC is
With this in mind, we can apply our simulated annealing MCMC to LDPC decoding and examine the properties of the bottlenecks, or metastable states, in our configuration landscape.
Exact values of the metastable energy states for the BEC
It is a well-known consequence of the 1RSB cavity method that the number of metastable states of energy grows like where is known as the energetic complexity function, a function whose form is implied by the 1RSB cavity equations. This computation can be carried out using a method called Survey Propagation which constructs a factor graph of the messages passed in the original BP factor graph and estimates the values of the marginals of the messages via another round of BP (hence the name 1-step RSB).
Neglecting the actual form of the calculations I will show the following approximate results for the BEC.
Fig 10 Metastable states for random elements of the (3,6) regular ensemble used over the BEC ( and . Left: the complexity function as a function of energy density above . Right: the maximum and minimum energy densities of metastable states as a function of the erasure probability. Note that at latex the curve is positive only for non zero metastable energy densities. This indicates exponentially many metastable states. At erasure rates above there are exponentially many degenerate ground states. Excerpted from Mezard and Montanari, 2009 Chapt 21
In the regime there exists a zero-energy word corresponding to the correct solution. On top of this, there exist non-trivial solutions to the 1RSB method yielding a complexity curve positive in the regime (). The positive complexity means that there are exponentially many such states and their finite energy means they violate a finite fraction of the parity checks, making their energy wells relatively deep.
As the error rate of the channel increases above the minimum energy of the metastable state reaches zero continuously. This means at noise levels above there are an exponential number of zero-energy states corresponding to configurations which aren’t code words. These codewords are separated by energy barriers thus making the relaxation-time of local algorithms, by the Arrhenius law in this regime.
Fig 11 Decoding random codes from the (3,6) ensemble over the BEC. In both cases , and the annealing schedule consists of equidistant temperatures in . Left: indicates convergence to the correct ground state at . Right: indicates convergence to approximately the energy density of the highest metastable states (as calculated by the complexity function via 1RSB) for .\ Excerpted from Mezard and Montanari, 2009 Chapt 21
Here you can see a rough sketch of convergence of the simulated annealing algorithm. As the temperature decreases in the regime the algorithm converges to a 0 energy ground state. In the figure on the right we can see that simulated annealing converges to the horizontal line here which corresponds to the energy of the highest metastable state for the BEC at .
Thus we see our local search algorithms end up being attracted to the highest energy of the metastable state.
Fig 12 Decoding random codes as in figure 11. Here we plot the minimum energy density achieved through simulated annealing plotted as a function of the erasure probability of the BEC. The continuous line indicates the highest lying metastable states as calculated from the complexity function via 1RSB. Excerpted from Mezard and Montanari, 2009 Chapt 21
Though there is not necessarily an exact correspondence between the residual energy at T=0 for simulated annealing and the highest metastable state we see that across all values of that at T=0, suggesting local search tends to get caught in the deepest metastable energy wells.
This discussion shows that the algorithmic threshold of BP, indicates the onset of a truly different regime within the energy landscape of the codebook. Metastable states of hight proliferate and become exponentially difficult to escape from via local search methods. Thus the failure of BP likely indicates a regime in which no fast algorithms can perform decoding, even though decoding is still theoretically possible when below , e.g. via exhaustive search of the codebook.
Appendix A: Random Code Ensembles
In an RCE, encoding maps applied to the information sequence are chosen with uniform probability over a solution space. Two decoding schemes are often used and applied to the noise – word MAP and symbol MAP decoding. MAP, otherwise known as “maximum a priori probability” works by maximizing the probability distribution to output the most probable transmission. Word MAP decoding schemes output the codeword with the highest probability by minimizing the block error probability, which is otherwise known as the probability with respect to the channel distribution that the decoded word is different than the true transmitted word. Symbol MAP decoding, on the other hand, minimizes the fraction of incorrect bits averaged over the transmitted code word (bit error rate).
RCE code is defined by the codebook in Hamming space, or the set of all binary strings of length . In a Hamming space characterized by uniform probability, the number of codewords at a given Hamming distance are a function of the distance enumerator. Distance enumerators take as parameters different weights, given that probabilities of codewords are independent of each other. The distance enumerator shows that, for small enough fractional distance from the true message , the growth rate is negative and the average number of codewords at small distance from vanishes exponentially with N. The Gilbert-Varshamov distance, a lower bound threshold, shows that the the average number of codewords is exponentially large at points where the weight numerator is concentrated.
We look at the performance of RCE code in communication over the Binary Symmetric Channel (BSC), where it is assumed that there is a probability p that transmitted bits will be “flipped” (i.e. with probability p, 1 becomes 0 and 0 becomes 1). With BSCs, channel input and output are the same length N sequences of bits. At larger noise levels, there are an exponential number of codewords closer to and decoding is unsuccessful. However, decoding via the symbol MAP decoding scheme shows that the -th bit is decoded by maximizing the marginal probability and amasses contributions from all codewords in the set. Above a threshold, the bit error rate is the same as if the message was transmitted without encoding and decoding, but below this, the RCE seems to work quite well in transmission.
Finite temperature decoding has also been looked at as an interpolation between the two MAP decoding schemes. At low noise, a completely ordered phase can be observed as compared to a glassy phase at higher noise channels. Similar to the a statistical physics model, we can also note an entropy dominated paramagnetic phase at higher temperatures.
Each decoding scheme can be analogized to “sphere packing”, where each probability in the Hamming space distribution represents a sphere of radius r. Decoding schemes have partitions in the Hamming space, so these spheres must be disjoint. If not, intersecting spheres must be eliminated. The lower bound of the remaining spheres is then given by Gilbert-Varshamov bound, whereas the upper bound is dictated by the Hamming distance.
Another random code beside the RCE is the RLC, or random linear code. Encoding in an RLC forms a scheme similar to a linear map, of which all points are equiprobable. The code is specified by an binary matrix, otherwise known as the generating matrix, and it is projected to be error-free below the Shannon capacity.
There are several sources of randomness in codes. Codes are chosen randomly from an ensemble and the codeword to be transmitted is chosen with uniform probability from the code, according to the theorem of source-channel separation. The channel output is then distributed according to a probabilistic process accounting for channel noise and decoding is done by constructing another probability distribution over possible channel inputs and by estimating its signal bit marginal. The decision on the i-th bit is dependent on the distribution. Thus, complications may arise in distinguishing between the two levels of randomness: code, channel input, and noise (“quenched” disorder) versus Boltzmann probability distributions.
Appendix B: Weight enumerators and code performance
The geometric properties of the LDPC codebooks is given by studying the distance enumerator to give the number of codewords at Hamming distance d from . This takes all-zeros codewords as the reference and uses the weight enumerator, as the denomination (number of codewords having weight equal to w). To estimate the expected weight enumerator for a random code in the ensemble, we know that grows exponentially in block-length N, and that each codeword has a weight that grows linearly with N. The exponential growth rate is defined by
(14)
denoting an ‘annealed average’, or a disordered system that could be dominated by rare instances in the ensemble. This gives an upper bound on the number of ‘colored factor graphs’ that have an even number of weighted incident edges divided by the total number of factor graphs in the ensemble.
On the other hand, for graphs of fixed degrees with N variable nodes of degree l and M function nodes of degree k, the total number of edges F is given by . A valid colored graph would have edges, with the number of variable nodes given in ways, l assignments of weighted sockets to nodes, and l assignments of unweighted sockets to nodes outside the set. If we take to denote the number of function nodes with weighted sockets under the constraints of and , we find the number of ways to color the function node sockets by
(15)
(16)
If we aim to join variable and check nodes so that colorings are matched, knowing that there are possible matchings in each ensemble element, this yields the following formula:
(17)
At low noise limits, code performance depends on the existence of codewords at distances close to the transmitted codeword. Starting with degree 1 and knowing that the parametric representation for weights is given by
(18)
derive that
(19)
when scale to . This shows that the exponential growth rate is strictly positive when is sufficiently small, and that the expected number of codewords within a small Hamming distance from a given codeword is exponential in N. If we take the logarithm of the expected weight enumerator and plot this versus the reduced weight for an irregular code with , we see that is positive near the origin, but that its dervative diverges as . Since this means that each codeword is surrounded by a large number of very close other codewords, this makes the code a very bad ECC and thus, makes it hard to discriminate between codewords at Hamming distances with noisy observations. Applying this same logic to of min 2, we still observe that tends to 0 more quickly as in the present case. If we assume that this holds beyond the asymptotic regime, we get
(20)
or that the number of codewords around a particular codeword is until a Hamming distance , otherwise known as the “effective minimum distance”. For , we find:
(21)
suggesting that LDPC codes with this property have good short distance behavior. Thus, any error that changes a fraction of the bits smaller than can be corrected in the absence of codewords within an extensive distance .
Let us now focus on the capacity of LDPC codes to correct typical errors in a probabilistic channel. For binary symmetric channels that flip each transmitted bit independently with probability . If the all-zero codeword has been transmitted as , whose components are iid random variables that take value 0 with probability and value 1 with probability , then we use the MAP decoding strategy to minimize the block error rate and output the codeword closest to the channel output . The expectation value of the code ensemble is an indicator of code ensemble performances. We will show that, as , codes with will undergo a phase transition separating a low noise from a high noise phase. To derive a lower bound for the capacity of LDPC codes in a BSC channel, we take as the size of the codebook and, by union bound:
(22)
(23)
(24)
This derivation proves that the block error probability depends on the weight enumerator and the . This second term shows that an increase in the weight of the codeword corresponds to their contribution being scaled down by an exponential factor. This is because it is less likely that the received message will be closer to a codeword of large weight than to the all-zero codeword. A geometric construction of this phenomena implies that for large enough , Shannon’s Theorem implies that is bounded away from 0 for any non-vanishing rate so that at any p less than the ML threshold for which the , one can communicate with an arbitrarily small error probability. At a probability equal to the lower bound, the upper bound on is dominated by codewords of weight , suggesting that each time an error occurs, a finite fraction of the its are decoded incorrectly and that this fraction doesn’t change very much per transmission. The construction also illustrates that this fraction of incorrectly decoded bits jumps discontinuously from 0 to a finite value when crosses the critical value , constituting a “gap.” This gap is close to a factor of 2.
Appendix C: BP performance
See figure 2 for an illustration of a factor graph illustrating this relationship. Again, recall that for LDPC code ensembles in large block-length limits, the degree distributions of variable nodes and check nodes are given by and respectively, where we assume that messages are initialized to for simplicity. This implies that the bit error probability is independent of the transmitted codeword and that therefore, we have the freedom to assume transmission of the all-zero codeword. In analyzing the recursion at the basis of the BP algorithm, we can show that decoding performance improves over time on the basis of symmetry and physical degradation.
Symmetry
Symmetry of channel log-likelihood and the variables appearing in density evolution are attributes of a desired BMS channel, suggesting that symmetry is preserved by BP operations in evolution. If we assume that the factor graph associated with an LDPC code is “tree-like”, we can apply BP decoding with a symmetric random initial condition and note that the messages and are symmetric variables at all . This observance of symmetry is analogous to the Nishimori condition in spin glasses and holds for the MAP log-likelihood of a bit as well.
Physical degradation
Let’s first define physical degradation with BMS channels. If we take two channels BMS(1) and BMS(2) denoted by transition matrices
and output alphabets , then BMS(2) is physically degraded with respect to BMS(1) if there exists a third channel C with input alphabet such that BMS(2) is the concatenation of BMS(1) and C. If we represent transition matrix C as we can represent the above physical degradation as
(25)
This is analogous to a Markov chain following partial ordering. Channel reliability is then ordered by measures of conditional entropy and bit error rate. This extends to symmetric random variables, which are associated with BMS channels.
Thresholds
We then fix a particular LDPC code and look at BP messages as random variables due to randomness in the vector with regards to the proposition down below, showing that the bit error rate decreases monotonously with time:
Proposition: If is a tree, then for any . Analogously, if , then for any .
Density evolution in this manner is a useful estimate of the number of distributions of density evolution variables and . By looking again at the bit error rate and the conditional entropy as both monotonically decreasing functions of the number of iterations and conversely, monotonically increasing functions of , we can derive a finite limit . The corresponding BP threshold can then be defined as
For , however, increasing the number of iterations does not help as the bit error rate is asymptotically lower bounded by for a fixed number of iterations. Good LDPC codes are thus designed with a large BP threshold with design rate to maximize the threshold noise level for a given degree distribution pair. This ensemble will have a finite fraction of variable nodes of degree 2 and a large number of codewords with small weight, which ultimately prevent the block error probability from vanishing as .
References
[1] Marc Mezard and Andrea Montanari. Information, Physics, and Computation. Oxford Graduate Texts, 2009.
Windows On Theory published first on Windows On Theory
0 notes