Wednesday, March 6, 2019

Part of Speech Recognizer

Improving Identi?er Informativeness apply get going of Speech Information Dave Binkley Matthew Hearn Dawn Lawrie Loyola University Maryland Baltimore MD 21210-2699, USA binkley, lawriecs. loyola. edu, emailprotected edu Key speech ascendent enactment abbreviation tools, earthy actors line process, program comprehension, identi?er analysis Abstract Recent softw be development tools have exploited the exploit of natural language entropy found within softw ar and its reinforcement documentation. To make the most of this study, researchers have drawn upon the work of the natural language processing community for tools and techniques.One much(prenominal) tool provides part-of-speech study, which ?nds application in up the searching of softw ar repositories and extracting domain information found in identi?ers. Unfortunately, the natural language found is softw be differs from that found in standard prose. This end likelyly limits the effectiveness of off-the-shelf tools. Th e represented empirical investigation ?nds that this limitation elicit be partially overcome, resulting in a tagger that is up to 88% high-fidelity when applied to get-go principle identi?ers.The investigation therefore expends the make betterd part-of-speech information to tag a large corpus of over 145,000 ? geezerhood relate. From patterns in the tags several rein ins emerge that seek to improve structure-? years label. Source crack up of Extract Split Apply Source ? Code ? house ? Fi historic period ? ? Speech scout Code Mark-up Tagging Names Names Figure 1. help for POS tagging of ? geezerhood label. The text available in source- recruit artifacts, in particular a programs identi?ers, has a very different structure. For example the talking to of an identi?er rarely form a grammatically straighten out sentence.This raises an enkindle question can an existing POS tagger be made to work surface on the natural language found in source edict? Better POS inform ation would aid existing techniques that have utilise bound POS information to successfully improve retrieval results from software repositories 1, 11 and have likewise investigated the comprehensibility of source code identi?ers 4, 6. Fortunately, machine learning techniques are deep and, as reported in Section 2, good results are obtained victimisation several sentence forming guidebooks.This initial investigation also suggest rules speci?c for software that would improve tagging. For example the pillow slip of a declared variable star can be factored into its tags. As an example application of POS tagging for source code, the tagger is past use to tag over 145,000 structure? age label. Equivalence classes of tags are therefrom examined to produce rules for the automatic identi?cation of poor call (as described in Section 3) and suggest better call, which is left to future work. 1 IntroductionSoftware applied science can bene?t from leveraging tools and techniques o f other disciplines. Traditionally, natural language processing ( human language technology) tools solve problems by processing the natural language found in documents such as news articles and web pages. One such NLP tool is a partof-speech (POS) tagger. Tagging is, for example, crucial to the Named-Entity Recognition 3, which enables information about a person to be tracked within and across documents. Many POS taggers are built use machine learning based on newswire educate info.Conventional wisdom is that these taggers work well on the newswire and exchangeable artifacts however, their effectiveness degrades as the input moves further away from the highly structured sentences found in traditional newswire articles. 1 2 Part-of-Speech Tagging Before a POS taggers turnout can be use as input to down sprout SE tools, the POS tagger itself ingests to be vetted. This section describes an experiment performed to prove the the true of POS tagging on ? years call mined from s ource code. The process used for mining and tagging the ?elds is ?rst described, followed by the empirical results from the experiment.Figure 1 shows the pipeline used for the POS tagging of ?eld adverts. On the left, the input to the pipeline is mode=space/ (683 came from C++ ?les and 817 from coffee berry ?les). A human accessor (and university student majoring in English) tagged the 1500 ?eld learn with POS information producing the visionary preparation. This oracle set is used to evaluate the accuracy of automatic tagging techniques when applied to the test set. Preliminary study of the Stanford tagger indicates that it needed counseling when tagging ?eld names.Following the work of Abebe and Tonella 1, intravenous feeding templates were used to provide this guidance. each(prenominal) template includes a slot into which the split ?eld name is inserted. Their accuracy is then evaluated using the oracle set. sentence pathfinder List Item Template Verb Template Nou n Template . Please, . is a thing . Figure 2. XML queries for extracting C++ and coffee tree ?elds from srcML. source code. This is then marked up using XML tags by srcML 5 to find out various syntactic categories. Third, ?eld names are extracted from the marked-up source using XPath queries.Figure 2 shows the queries for C++ and coffee. The fourth stage splits ?eld names by replacing underscores with spaces and inserting a space where the case changes from lowercase to uppercase. For example, the names spongeBob and sponge bob become sponge bob. After splitting, all characters are shifted to lowercase. This stage also ?lters names so that only those that consist whole of dictionary rowing are retained. Filtering uses Debians American (6-2) dictionary package, which consists of the 98,569 words from Kevin Atkinsons SCOWL word lists that have size 10 by 50 2.This dictionary includes some(a) common abbreviations, which are thus include in the ?nal selective information set. Futu re work will obviate the need for ?ltering through vocabulary normalization in which non-words are split into their abbreviations and then expanded to their natural language equivalents 9. The ?fth stage applies a set of templates (described below) to all(prenominal) separated ?eld name. from each one template effectively wraps the words of the ?eld name in an attempt to improve the cognitive operation of the POS tagger. Finally, POS tagging is performed by variate 1. 6 of the Stanford Log-linear POS Tagger 12.The default options are used including the pretrained bidirectional model 10. The residual of this section considers empirical results concerning the effectiveness of the tagging pipeline. A total of 145,163 ?eld names were mined from 10,985 C++ ?les and 9,614 Java ?les found in 171 programs. From this full data set, 1500 names were randomly chosen as a test set 2 The Sentence Template, the simplest of the four, considers the identi?er itself to be a sentence by appendin g a period to the split ?eld. The List Item Template exploits the tagger having lettered about POS information found in the sentence fragments used in lists.The Verb Template tries to encourage the tagger to treat the ?eld name as a verb or a verb phrase by pre?xing it with Please, since usually a direct follows. Finally, the Noun Template tries to encourage the tagger to treat the ?eld as a noun by post?xing it with is a thing as was done by Abebe and Tonella 1. knock back 1 shows the accuracy of using each template applied to the test set with the output compared to the oracle. The major diagonal represents each technique in isolation while the remaining entries require two techniques to agree and thus lowering the percentage.The similarity of the percentages in a column gives an indication of how similar the set of correctly tagged names is for two techniques. For example, considering Sentence Template, Verb Template has the lowest overlap of the remaining three as indicated by its joint percentage of 71. 7%. Overall, the List Item Template performs the best, and the Sentence Template and Noun Template produce congenitally identical results getting the correct tagging on nearly all the same ?elds. Perhaps unsurprising, the Verb Template performs the worst.Nonetheless, it is provoke that this template does produce the correct output on 3. 2% of the ?elds where no other template succeeds. As shown in Table 2 boilers suit at least one template correctly tagged 88% of the test set. This suggests that it may be possible to combine these results, possibly using machine learning, to produce higher accuracy than achieved using the individual templates. Although 88% is lower than the 97% achieved by natural language taggers on the newswire data, the process is still quite high considering the lack of consideration provided by the words of a single structure ?eld.Sentence List Item Verb Noun Sentence 79. 1% 76. 5& 71. 7% 77. 0% List Item 76. 5% 81. 7% 71. 0 % 76. 0% Verb 71. 7% 71. 0% 76. 0% 70. 8% Noun 77. 0% 76. 0% 70. 8% 78. 7% this context is used to represent a current state, and is therefore not confusing. receive 1 Non-boolean ?eld names should never contain a present tense verb * * ? * * Table 1. Each percentage is the percent of correctly tagged ?eld names using both the row and column technique thus the major diagonal represent each technique independently. crystallise in all templates Correct in at least one template 68. 9% 88. 0% Table 2.Correctly tagged identi?ers As illustrated in the succeeding(a) section, the identi?cation is suf?ciently accurate for use by downstream consumer applications. 3 Rules for Improving Field Names As an example application of POS tagging for source code, the 145,163 ?eld names of the full data set were tagged using the List Item Template, which showed the best performance in Table 1. The resulting tags were then used to form equivalence classes of ?eld names. Analysis of these classes led t o four rules for improving the names of structure ?elds. Rule violations can be automatically identi?ed using POS tagging.Further, as illustrated in the examples, by mining the source code it is possible to suggest potential replacements. The assumption behind each rule is that high quality ?eld names will provide break down conceptual information, which aids an engine driver in the task of forming a mental understanding of the code. Correct part-of-speech information can help inform the naming of identi?ers, a process that is essential in communicating intent to future programmers. Each rule is ?rst informally introduced and then formalized. After each rule, the percentage of ?elds that violate the rule is given.Finally, some rules are followed by a discussion of rule exclusions or cerebrate notions. The ?rst rule observes that ?eld names represent objects not actions thus they should forfend present-tense verbs. For example, the ?eld name create mp4, clearly implies an actio n, which is unlikely the intent (unless perhaps the ?eld represent a function pointer). Inspection of the source code reveals that this ?eld holds the desired mp4 video stream container type. Based on the context of its use, a better, less ambiguous name for this identi?er is created mp4 container type, which includes the past-tense verb created.A notable exception to this is ?elds of type boolean, like, for example, is logged in where the present tense of the verb to be is used. A present tense verb in 3 Violations detected 27,743 (19. 1% of ?eld names) sounding at the violations of Rule 1 one pattern that emerges suggests an procession to the POS tagger that would better specialize it to source code. A pattern that frequently occurs in graphical user interface programming ?nds verbs used as adjectives when describing GUI elements such as expirations. Recognizing such ?elds based on their type should improve tagger accuracy. Consider the ?elds delete button and to a lesser exten t continue box.In isolation these appears to represent actions. notwithstanding they actually represent GUI elements. Thus, a special context-sensitive case in the POS tagger would tag such verbs as adjectives. The second rule considers ?eld names that contain only a verb. For example the ?eld name recycle. This name communicates little to a programmer unfamiliar with the code. Examination of the source code reveals that this variable is an integer and, based on the comments, it counts the number of things recycled. While this content can be inferred from the declaration and the comments surrounding it, ?eld name uses a lot occur far from their eclaration, reducing the value of the declared type and encouraging comments. A potential ?x in this case is to change the name to recycled count or things recycled. Both alternatives improve the clarity of the name. Rule 2 Field names should never be only a verb ? ? or ? ? Violations detected 4,661 (3. 2% ?eld names identi?ers) The third rule considers ?eld names that contain only an adjective. While adjectives are useful when used with a noun, an adjective alone relies too much on the type of the variable to fully explain its use.For example, consider the identi?er enkindle. In this case, the declared type of list provides the insight that this ?eld holds a list of enkindle items. Replacing this ?eld with interesting list or interesting items should improve code understanding. Rule 3 Field names should never be only an adjective ? Violations detected 5,487 (3. 8% ?eld names identi?ers) An interesting exception to this rule occurs with data structures where the ?eld name has an established stately meaning. For example, when naming the next node in a linked list, next is commonly accepted.Other similar common names include previous and current. The ?nal rule deals with ?eld names for booleans. Boolean variables represent a state that is or is not and this notion needs to be obvious in the name. The identi?er del eted offers a good example. By itself there is no way to deal for sure what is being represented. Is this a pointer to a deleted thing? Is it a count of deleted things? Source code inspection reveals that such boolean variables pitch to represent whether or not something is deleted. Thus a potential improved names include is deleted or was deleted.Rule 4 Boolean ?eld names should contain third person forms of the verb to be or the auxiliary verb should * ? is was should * 5 Summary This paper presents the results on an experiment into the accuracy of the Stanford Log-linear POS Tagger applied to ?eld names. The best template, List Item, has an accuracy of 81. 7%. If an optimal combination of the four templates were used the accuracy rises to 88%. These POS tags were then used to develop ?eld name formation rules that 28. 9% of the identi?ers violated. Thus the tagging can be used to support improved naming.Looking forward, two avenues of future work include automating this impro vement and enhancing POS tagging for source code. For the ?rst, the source code would be mined for related cost to be used in suggested improved names. The second would explore teaching a POS tagger using, for example, the machine learning technique domain adaptation 8, which accentuate the text in the training that is most similar to identi?ers to produce a POS tagger for identi?ers. 6 Acknowledgments Special thanks to Mike Collard for his help with srcML and the XPath queries and Phil Hearn for his help with creating the oracle set.Support for this work was provided by NSF grant CCF 0916081. Violations detected 5,487 (3. 8% ?eld names identi?ers) Simply adding is or was to booleans does not guarantee a ?x to the problem. For example, coin a boolean variable that indicates whether something should be al investd in a program. In this case, the boolean captures whether some event should take place in the future. In this example an appropriate temporal find is missing from the na me. A name like allocated does not provide enough information and naming it is allocated does not make logical sense in the context of the program.A solution to this naming problem is to change the identi?er to should be allocated, which includes the necessary temporal sense communicating that this boolean is a ?ag for something expected to happen in the future. References 1 S. L. Abebe and P. Tonella. Natural language parsing of program element names for concept extraction. In 18th IEEE International Conference on Program Comprehension. IEEE, 2010. 2 K. Atkinson. Spell checking orient word lists (scowl). 3 E. Boschee, R. Weischedel, and A. Zamanian. Automatic information extraction.In Proceedings of the International Conference on Intelligence Analysis, 2005. 4 B. Caprile and P. Tonella. Restructuring program identi?er names. In ICSM, 2000. 5 ML Collard, HH Kagdi, and JI Maletic. An XML-based light C++ fact extractor. Program Comprehension, 2003. 11th IEEE International Workshop on, pages 134143, 2003. 6 E. Hst and B. stvold. The programmers lexicon, volume i The verbs. In International work Conference on Source Code Analysis and Manipulation, Beijing, China, September 2008. 7 E. W. Hst and B. M. stvold. Debugging method names.In ECOOP 09. Springer Berlin / Heidelberg, 2009. 8 J. Jiang and C. Zhai. Instance weighting for domain adaptation in nlp. In ACL 2007, 2007. 9 D. Lawrie, D. Binkley, and C. Morrell. Normalizing source code vocabulary. In Proceedings of the 17th functional Conference on Reverse Engineering, 2010. 10 L. Shen, G. Satta, and A. K. Joshi. Guided learning for bidirectional date classi?cation. In ACL 07. ACL, June 2007. 11 D. Shepherd, Z. P. Fry, E. Hill, L. Pollock, and K. Vijay-Shanker. Using natural language program analysis to locate and understand action-oriented conerns.In AOSD 07. ACM, March 2007. 12 K. Toutanova, D. Klein, C. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In HLTNAACL 2003, 2003. 4 Related Work This section brie?y reviews three projects that use POS information. Each uses an off-the-shelf POS tagger or lookup table. First, Host et al. study naming of Java methods using a lookup table to assign POS tags 7. Their aim is to ?nd what they call naming bugs by checking to see if the methods execution is properly indicated with the name of the method.Second, Abebe and Tonella study class, method, and attribute names using a POS tagger based on a modi?cation of minipar to formulate domain concepts 1. Nouns in the identi?ers are examined to form ontological relations between concepts. Based on a case study, their approach improved concept searching. Finally, Shepherd et al. considered ?nding concepts in code using natural language information 11. The resulting Find-Concept tool locates action-oriented concerns more effectively than the other tools and with less user effort. This is made possible by POS information applied to source code. 4

No comments:

Post a Comment