Statistical Techniques For Language Recognition: An Empirical Study Using Real And Simulated English

Ganesan, Ravi; Sherman, Alan T.

Statistical Techniques For Language Recognition: An Empirical Study Using Real And Simulated English

dc.contributor.author	Ganesan, Ravi
dc.contributor.author	Sherman, Alan T.
dc.date.accessioned	2019-02-21T16:00:26Z
dc.date.available	2019-02-21T16:00:26Z
dc.date.issued	2010-06-04
dc.description.abstract	Computer experiments compare the effectiveness of five test statistics at recognizing and distinguishing several types of real and simulated English strings. These experiments measure the statistical power and robustness of the test statistics X², ML, IND, S, and IC when applied to samples of everyday American English from the Brown Corpus and Wall Street Journal and to simulated English generated from lst-order Markov models based on these samples. An empirical approach is needed because the asymptotic theory of statistical inference on Markov chains does not apply to short strings drawn from natural language. Here, X² is the chi-squared test statistic; ML is a likelihood ratio test for recognizing a known language; IND is a likelihood ratio test for distinguishing unknown Oth-order noise from unknown lst-order language; S is a log-likelihood function that is a most-powerful test for distinguishing a known language from uniform noise; and IC is the index of coincidence. The test languages comprise four types of real English, two types of simulated lst-order English, and three types of noise. Two experiments characterize the distributions of these test statistics when applied to nine test languages, presented as strings of different lengths and contaminated with various amounts of noise. Experiment 1 varies the length of the string from 2 to 2¹⁷ characters. Experiment 2 adds uniform noise to samples of three fixed lengths (2⁴, 2⁷, 2¹⁰), with the amount of added noise ranging from 0% to 100%. These experiments assess the performance of the test statistics under realistic cryptographic constraints. Using graphs and tables of observed statistical power, we compare the effectiveness of the test statistics at distinguishing various pairs of languages at several critical levels. Although no statistic dominated all others for all critical levels and string lengths, each test performed well at its designated task. For distinguishing a known type of English from uniform noise at critical levels 0.1 through 0.0001, X 2 attained the highest power, with ML and S also performing well. For distinguishing uniform noise from a known type of English at the same critical levels, ML had the overall best performance, with IC, X², S, and IND also performing well. And for each of these tasks under noisy conditions, ML attained the highest power. In addition, through histograms we describe the actual distribution of each statistic on various language types. These detailed results, which show relationships between power, critical level, and string length, will help cryptanalysts and others apply statistical methods to practical language-recognition problems.	en_US
dc.description.uri	https://www.tandfonline.com/doi/pdf/10.1080/0161-119491882919?needAccess=true	en_US
dc.format.extent	21 pages	en_US
dc.genre	journal articles postprints	en_US
dc.identifier	doi:10.13016/m2y6pw-pf7r
dc.identifier.citation	Ravi Ganesan & Alan T. Sherman (1994) STATISTICAL TECHNIQUES FOR LANGUAGE RECOGNITION: AN EMPIRICAL STUDY USING REAL AND SIMULATED ENGLISH, CRYPTOLOGIA, 18:4, 289-331, DOI: 10.1080/0161-119491882919	en_US
dc.identifier.uri	http://hdl.handle.net/11603/12834
dc.language.iso	en_US	en_US
dc.publisher	Taylor & Francis
dc.relation.isAvailableAt	The University of Maryland, Baltimore County (UMBC)
dc.relation.ispartof	UMBC Center for Research and Exploration in Space Sciences & Technology II (CRSST II)
dc.relation.ispartof	UMBC Faculty Collection
dc.relation.ispartof	UMBC Computer Science and Electrical Engineering Department
dc.rights	This item is likely protected under Title 17 of the U.S. Copyright Law. Unless on a Creative Commons license, for uses protected by Copyright Law, contact the copyright holder or the author.
dc.rights	“This is an Accepted Manuscript of an article published by Taylor & Francis in CRYPTOLOGIA, on 04 Jun 2010, available online: http://www.tandfonline.com/10.1080/0161-119491882919”
dc.subject	automatic language recognition	en_US
dc.subject	statistical approaches to language recognition	en_US
dc.subject	cryptography	en_US
dc.subject	cryptanalysis	en_US
dc.title	Statistical Techniques For Language Recognition: An Empirical Study Using Real And Simulated English	en_US
dc.type	Text	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: ShermanCryptologia94.pdf
Size:: 1012.45 KB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.56 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

UMBC Center for Information Security and Assurance (CISA)
UMBC Computer Science and Electrical Engineering Department
UMBC Faculty Collection