Statistical Techniques for Language Recognition: An Introduction and Guide for Cryptanalysts
Loading...
Author/Creator
Author/Creator ORCID
Date
2010-06-04
Type of Work
Department
Program
Citation of Original Publication
Ravi Ganesan & Alan T. Sherman (1993) STATISTICAL TECHNIQUES FOR LANGUAGE RECOGNITION: AN INTRODUCTION AND GUIDE FOR CRYPTANALYSTS, CRYPTOLOGIA, 17:4, 321-366, DOI: 10.1080/0161-119391867980
Rights
This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
“This is an Accepted Manuscript of an article published by Taylor & Francis in Cryptologia on04 Jun 2010, available online: http://www.tandfonline.com/10.1080/0161-119391867980.”
“This is an Accepted Manuscript of an article published by Taylor & Francis in Cryptologia on04 Jun 2010, available online: http://www.tandfonline.com/10.1080/0161-119391867980.”
Subjects
automatic plaintext recognition
categorical data
chi-squared test statistic
computational linguistics
contingency tables
cryptanalysts
cryptography
document processing
hypothesis testing
index of coincidence
language recognition
likelihood ratio tests
markov models of language
maximum likelihood estimators
natural language processing
statistical inference
statistical pattern recognition
statistics of language
weight of evidence
categorical data
chi-squared test statistic
computational linguistics
contingency tables
cryptanalysts
cryptography
document processing
hypothesis testing
index of coincidence
language recognition
likelihood ratio tests
markov models of language
maximum likelihood estimators
natural language processing
statistical inference
statistical pattern recognition
statistics of language
weight of evidence
Abstract
We explain how to apply statistical techniques to solve several language-recognition problems that arise in cryptanalysis and other domains. Language recognition is important in cryptanalysis because, among other applications, an exhaustive key search of any cryptosystem from ciphertext alone requires a test that recognizes valid plaintext. Written for cryptanalysts, this guide should also be helpful to others as an introduction to statistical inference on Markov chains. Modeling language as a finite stationary Markov process, we adapt a statistical model of pattern recognition to language recognition. Within this framework we consider four well-defined language-recognition problems: 1) recognizing a known language, 2) distinguishing a known language from uniform noise, 3) distinguishing unknown 0th-order noise from unknown lst-order language, and 4) detecting non-uniform unknown language. For the second problem we give a most powerful test based on the Neyman-Pearson Lemma. For the other problems, which typically have no uniformly most powerful tests, we give likelihood ratio tests. We also discuss the chi-squared test statistic X 2 and the Index of Coincidence IC. In addition, we point out useful works in the statistics and pattern-matching literature for further reading about these fundamental problems and test statistics.