
Information theory is one of the youngest branches of applied probability theory; it is not yet ten years old. The date of its birth can, with certainty, be considered to be the appearance in 19471948 of the by now classical work of Claude Shannon. Rarely does it happen in mathematics that A new discipline achieves the character of a mature and developed scientific theory in the first investigation devoted to it. Such in its time was the case with the theory of integral equations, after the fundamental work of Fredholm; so it was withinformation theory after the work of Shannon.
From the very beginning, information theory presents mathematics with a whole new set of problems, including some very difficult ones. It is quite natural that Shannon and his first disciples, whose basic goal was to obtain practical results, were not able to pay enough attention to these mathematical difficulties at the beginning. Consequently, at many points of their investigations. they were compelled either to be satisfied with reasoning of an inconclusive nature or to limit artificially the set of objects studied, (sources, channels, codes, etc.) in order to simplify the proofs. Thus, the whole mass of literature of the first years of information theory, of necessity, bears the imprint of mathematical incompleteness which, in particular, makes it extremely difficult for mathematicians to become acquainted with this new subject. The recently published general textbook on information theory by S Goldman can serve as a typical example of the style prevalent in this literature.
Investigations, with the aim of setting information theory on a solid mathematical basis have begun to appear only in recent years and, at the present time, are few in number. First of all, we must mention the work of McMillan in which the fundamental concepts of the theory of discrete sources (source, channel, code, etc.) were first given precise mathematical definitions. The most important result of this work must be considered to be the proof of the remarkable theorem that any discrete ergodic source has the property which Shannon attributed to sources of Markov type and which underlies almost all the asymptotic calculations of information theory. This circumstance permits the whole theory of discrete information to be constructed without being limited, as was Shannon, to Markov type sources. In the rest of his paper McMillan tries to put Shannon's fundamental theorem on channels with noise on a rigorous basis. In doing so, it becomes apparent that the sketchy proof given by Shannon contains gaps which remain even in the case of Markov sources. The elimination of these gaps is begun in McMillan's paper, but is not completed.
Next, it is necessary to mention the work of Feinstein. Like McMillan, Feinstein considers the Shannon theorem on channels with noise to be the pinnacle of the general theory of discrete information and he undertakes to give a mathematically rigorous proof of this theorem. Accepting completely McMillan's mathematical apparatus, he avoids following Shannon's original path and constructs a proof, using the completely new and apparently very fruitful idea of a "distinguishable set of sequences", the principal features of which will be explained below. However, Feinstein carries out the proof in all details only for the simplest and least practical case, where the successive signals of the source are mutually independent and the channel memory is zero. In the more general case, he indicates only sketchily how the reader is to carry out the necessary reasoning independently. Unfortunately, there remains a whole series of significant difficulties.
As is well known, Shannon formulated his theorem on channels with noise in two different ways. One was in terms of a quantity called equivocation, and the other was in terms of the probability of error. McMillan's analysis leads to the conclusion that these two formulations are not equivalent, and that the second gives a more exact result than the first. Feinstein's more detailed investigation showed that although the first formulation is implied by the second, a rigorous derivation of this implication is not only nontrivial but fraught with considerable additional difficulties. Since both formulations are equally important in actual content, it is preferable to speak about two Shannon theorems rather than combine them under the same heading.
In this paper I attempt to give a complete, detailed proof of both of these Shannon theorems, assuming any ergodic source and any stationary channel with a finite memory. At the present time, apparently, these are the broadest hypotheses under which the Shannon theorems can be regarded as valid. On the whole, I follow the path indicated in the works of McMillan and Feinstein, deviating from them only in the comparatively few cases when I see a gap in their explanation, or when another explanation seems to me more complete and convincing (and sometimes, more simple).
The first chapter of the paper, which is of purely auxiliary character, requires special explanation. It is devoted to the derivation of a whole set of unrelated inequalities, each of which is a theorem of elementary probability theory (i.e., pertains only to finite spaces). The reader acquainted with my paper The entropy concept in probability theory (Russian) (1953) will be able to begin this paper with the second chapter, returning to the first chapter only, when references to its results appear in the text. All the following chapters are constructed according to a specific plan, and can not be skipped or read in different order.
The reader will see that the path to the Shannon theorems is long and thorny, but apparently science, at this time, knows no shorter path if we do not want artificial restrictions on the material studied and if we are to avoid making statements which we can not prove.
The URL of this page is:
http://wwwhistory.mcs.standrews.ac.uk/Extras/Khinchin_Information.html