Statistical Unigram Analysis for Source Code Repository

dc.contributor.authorXu, Weifeng
dc.contributor.authorXu, Dianxiang
dc.contributor.authorAriss, Omar El
dc.contributor.authorLiu, Yunkai
dc.contributor.authorAlatawi, Abdularaham
dc.contributor.departmentSchool of Criminal Justiceen_US
dc.contributor.programComputer Scienceen_US
dc.date.accessioned2020-05-26T14:47:27Z
dc.date.available2020-05-26T14:47:27Z
dc.description.abstractUnigram is a fundamental element of n-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultra-large source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub.com. By analyzing these unigrams, we have discovered statistical patterns regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. Our study describes a probabilistic model for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. It shows that the unigrams collected from source code repositories are essential resources to solving the domain specific problems.en_US
dc.description.urihttps://www.bowiestate.edu/files/resources/abdul-alatawi-research-paper-bigmm2017.pdfen_US
dc.format.extentpages 8en_US
dc.genreconference paperen_US
dc.identifierdoi:10.13016/m2d58q-xtmy
dc.identifier.uriDOI 10.1109/BigMM.2017.13
dc.identifier.urihttp://hdl.handle.net/11603/18741
dc.language.isoenen_US
dc.relation.isAvailableAtUniversity of Baltimore
dc.subjectprogramming languageen_US
dc.subjectSource Codeen_US
dc.subjectn-gramen_US
dc.subjectunigramen_US
dc.subjectabbreviationsen_US
dc.titleStatistical Unigram Analysis for Source Code Repositoryen_US
dc.title.alternative2017 IEEE Third International Conference on Multimedia Big Dataen_US
dc.typeEventen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
abdul-alatawi-research-paper-bigmm2017.pdf
Size:
322.63 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: