Statistical Techniques for Language Recognition: An Introduction and Guide for Cryptanalysts
MetadataShow full item record
Type of Work36 pages
journal articles postprints
Citation of Original PublicationRavi Ganesan & Alan T. Sherman (1993) STATISTICAL TECHNIQUES FOR LANGUAGE RECOGNITION: AN INTRODUCTION AND GUIDE FOR CRYPTANALYSTS, CRYPTOLOGIA, 17:4, 321-366, DOI: 10.1080/0161-119391867980
RightsThis item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
“This is an Accepted Manuscript of an article published by Taylor & Francis in Cryptologia on04 Jun 2010, available online: http://www.tandfonline.com/10.1080/0161-119391867980.”
Subjectsautomatic plaintext recognition
chi-squared test statistic
index of coincidence
likelihood ratio tests
markov models of language
maximum likelihood estimators
natural language processing
statistical pattern recognition
statistics of language
weight of evidence
We explain how to apply statistical techniques to solve several language-recognition problems that arise in cryptanalysis and other domains. Language recognition is important in cryptanalysis because, among other applications, an exhaustive key search of any cryptosystem from ciphertext alone requires a test that recognizes valid plaintext. Written for cryptanalysts, this guide should also be helpful to others as an introduction to statistical inference on Markov chains. Modeling language as a finite stationary Markov process, we adapt a statistical model of pattern recognition to language recognition. Within this framework we consider four well-defined language-recognition problems: 1) recognizing a known language, 2) distinguishing a known language from uniform noise, 3) distinguishing unknown 0th-order noise from unknown lst-order language, and 4) detecting non-uniform unknown language. For the second problem we give a most powerful test based on the Neyman-Pearson Lemma. For the other problems, which typically have no uniformly most powerful tests, we give likelihood ratio tests. We also discuss the chi-squared test statistic X 2 and the Index of Coincidence IC. In addition, we point out useful works in the statistics and pattern-matching literature for further reading about these fundamental problems and test statistics.
Showing items related by title, author, creator and subject.
Statistical Techniques For Language Recognition: An Empirical Study Using Real And Simulated English Ganesan, Ravi; Sherman, Alan T. (Taylor & Francis, 2010-06-04)Computer experiments compare the effectiveness of five test statistics at recognizing and distinguishing several types of real and simulated English strings. These experiments measure the statistical power and robustness ...
Matuszek, Cynthia; Herbst, Evan; Zettlemoyer, Luke; Fox, Dieter (Springer Nature Switzerland AG., 2012-06)As robots become more ubiquitous and capable of performing complex tasks, the importance of enabling untrained users to interact with them has increased. In response, unconstrained natural-language interaction with robots ...
Han, Lushan; Finin, Tim; McNamee, Paul; Joshi, Anupam; Yesha, Yelena (IEEE, 2013-06-01)Pointwise mutual information (PMI) is a widely used word similarity measure, but it lacks a clear explanation of how it works. We explore how PMI differs from distributional similarity, and we introduce a novel metric, ...