A MODULAR SYSTEM FOR SUPPORT OF EXPERIMENTS IN TEXT CLASSIFICATION MODULARNY SYSTEM WSPOMAGANIA DOŚWIADCZEŃ Z DZIEDZINY KLASYFIKACJI TEKSTU - PDF

Description
MICHAL PTASZYNSKI *, PAWEL LEMPA **, FUMITO MASUI * A MODULAR SYSTEM FOR SUPPORT OF EXPERIMENTS IN TEXT CLASSIFICATION MODULARNY SYSTEM WSPOMAGANIA DOŚWIADCZEŃ Z DZIEDZINY KLASYFIKACJI TEKSTU Abstract

Please download to get full document.

View again

of 15
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information
Category:

Word Search

Publish on:

Views: 21 | Pages: 15

Extension: PDF | Download: 0

Share
Transcript
MICHAL PTASZYNSKI *, PAWEL LEMPA **, FUMITO MASUI * A MODULAR SYSTEM FOR SUPPORT OF EXPERIMENTS IN TEXT CLASSIFICATION MODULARNY SYSTEM WSPOMAGANIA DOŚWIADCZEŃ Z DZIEDZINY KLASYFIKACJI TEKSTU Abstract This paper presents a modular system for the support of experiments and research in text classification. Usually the research process requires two general kinds of abilities. Firstly, to laboriously analyse the provided data, perform experiments and from the experiment results create materials for preparing a scientific paper such as tables or graphs. The second kind of task includes, for example, providing a creative discussion of the results. To help researchers and allow them to focus more on creative tasks, we provide a system which helps performing the laborious part of research. The system prepares datasets for experiments, automatically performs the experiments and from the results calculates the scores of Precision, Recall, F-score, Accuracy, Specificity and phi-coefficient. It also creates tables in the LaTex format containing all the results and it draws graphs depicting and informatively comparing each group of results. Keywords: Experiment support, Pattern extraction, Graph generation Streszczenie W niniejszym artykule przedstawiono modularny system wspomagania eksperymentów i badań z dziedziny klasyfikacji tekstu. Zazwyczaj proces badawczy wymaga dwóch typów umiejętności. Do pierwszego typu można zaliczyć wykonanie mozolnej analizy dostępnych danych, przeprowadzenie eksperymentu, a z wyników eksperymentu stworzenie materiałów do umieszczenia w pracy naukowej, takich jak tabele lub wykresy. Drugi typ umiejętności obejmuje na przykład przeprowadzenie twórczej dyskusji na temat otrzymanych wyników. Aby pomóc naukowcom skupić się na zadaniach twórczych, w niniejszej pracy prezentujemy system, który ułatwia wykonywanie żmudnej części badań. System przygotowuje zestawy danych do eksperymentów, automatycznie przeprowadza eksperymenty, a z wyników oblicza wyniki w formie Precyzji, Zwrotu, F-miary, Dokładności, Swoistości i współczynnika phi. System tworzy również tabele w formacie LaTeX zawierające wyniki eksperymentów i rysuje wykresy porównujące każdą grupę wyników. Słowa kluczowe: Wspomaganie eksperymentów, ekstrakcja wzorców, rysowanie grafów Ph.D. Michal Ptaszynski, Ph.D. Fumito Masui, Department of Computer Science, Kitami Institute of Technology. ** M.Sc. Pawel Lempa, Institute of Applied Informatics, Faculty of Mechanical Engineering, Cracow University of Technology. Introduction It is often said ironically about economists: If you are so smart, why aren t you rich? A similar remark can be said about researchers involved in natural language processing (NLP), or computational linguistics (CL): If you have so many language analysis and generation techniques, why don t you use them to perform the research for you and generate a paper and presentation slides in the end? Unfortunately, there has been astonishingly little research on scientific paper generation, presentation slides generation or even on support of the research process. One of the reasons for this is the fact that many stages of the research process require creativity, for which effective computational models still do not exist. Parts of the research which require such creative skills include for example, preparing descriptions of research background, literature review, and especially, discussion and detailed analysis of the results of experiments. However, apart from these creative parts of research, a wide range of activities involved in the process is of a different, non-creative nature. Preparing data for experiments, conducting the experiments, step-by-step manual changing of feature sets to train and test machine learning classifiers are only some of the examples. Moreover, a thorough calculation of final scores of the evaluated tools, generating tables for the description of experiment results in technical reports and scientific papers, generating graphs from those results, and finally, description and analysis of the results all those tasks do not require creative thinking. On the contrary, they are the non-creative part of everyday research drill. However, despite being non-creative, such activities are laborious since they require most of the researcher s focus and precision. This could influence the motivation towards research and in practice consumes time, which could be used more efficiently for creative tasks, such as writing a detailed and convincing discussion of the results. To help the researchers perform their research in a more convenient and efficient way we developed a system for the support of research activities and writing technical reports and scientific papers. The system is released as an Open Source set of libraries. After being initialized by one short command, the whole process including preparation of data for the experiment, conducting the experiment and generating materials helpful in writing a scientific paper is conducted automatically. The paper s outline is as follows. Firstly, we describe the background of our research in which we introduce a number of research studies similar to ours. Secondly, we describe the whole system. We present in detail each of the parts responsible for data preparation, experiment conduction and generation of supporting materials. Next, we present the evaluation process, which verifies the practical usability of the system. Finally, we conclude the paper and propose other features we plan to implement in the near future. 2. Related Research The research on supporting the process of research itself is rare. The authors found only a few pieces of literature that could be considered as related to the presented system. 231 One of the most usable and helpful environments developed so far is the WEKA environment 1. WEKA provides a wide range of machine learning algorithms useful in data mining tasks. It can be used as a stand-alone software, or can be called from a custom Java code to analyse data on the fly. WEKA allows data pre-processing, classification or clustering. It also provides simple visualizations of results. WEKA is widely used in the research society, especially in natural language processing (NLP) and computational linguistics (CL) fields. Unfortunately, WEKA needs especially prepared files with measurements in appropriate columns and cannot deal with plain unprocessed data (unprocessed collections of sentences, etc.). It also does not provide graphs in the format easily applicable in a research paper, nor does it provide natural language descriptions of the analysis of results. Another tool with well-established renown is the Natural Language Toolkit (NLTK) 2. NLTK is a Python-based platform allowing various experiments with human language data. It provides a number of tools for text classification, tokenization, stemming, and tagging, parsing, and semantic reasoning. A worth mentioning feature of NLTK is that it provides not only tools for text processing, but also a number of natural language corpora and lexical resources. A disadvantage of NLTK is that all the tools need to be launched separately. This requires at least minimal knowledge of programming languages, preferably Python, in which the toolkit was created. Thus NLTK is a very powerful and useful tool for researchers with at least minimal knowledge of programming. Another disadvantage is that not all tools included in NLTK are compatible with languages of non-alphabetic transcription (Japanese, Chinese, Korean, etc.). In a different kind of research, Nanba et al. [1] focus on automatic generation of literature review. They assumed that in research papers researchers include short passages which can be considered as a kind of summary describing the essence of a paper and the differences between the current and previous research. Their research was very promising, as Nanba et al. [1] dealt with the creative part of research. Unfortunately, after the initial paper which presents interesting preliminary results, the method has not been developed any further. This could suggest that the creative part of the research they attempted to support, namely description of background and previous research, could still be too difficult to perform fully automatically. Shibata and Kurohashi [2] focused on a different task, namely, on automatically generating summary slides from texts. This is not exactly the same task as creating presentation slides from a scientific paper, which we consider as one of our future tasks. However, the method they proposed, after several modifications, could be applied in our research as well. They generated slides by itemizing topic and non-topic parts extracted from syntactically analysed text. In our method the parts created by the system are grouped automatically, which could help in the itemization process. Apart from the research described above, an interesting, although not quite scientific experiment was done by anonymous researchers involved in a campaign against dubious conferences 3. In their attempt they generated scientific papers by picking random parts of actual papers and submitted those fake-papers to specific conferences to verify the review process of those questionable conferences. They succeeded in their task and were accepted to the conferences, which in general proved that the process of review of some conferences https://sites.google.com/site/dumpconf/ 232 is not of the highest quality. Therefore, if there is a similar attempt in the future, although desirably more ambitious (non-random scientific paper generation), it should submit the artificially created papers to conferences or journals of proved and well known reputation. 3. System Description SPASS, or Scientific Paper Writing Support System performs three tasks. Firstly, it prepares the data for the experiment, secondly, it conducts the experiment under the conditions selected by the user, and thirdly, it summarizes the results and prepares the materials for a technical report or a scientific paper. We describe each part of the process in the sections below. The general overview of the system is represented in Figure User Input The system performs the laborious and non-creative tasks from the process of research automatically with one click. It is developed to help in text classification and analysis tasks. At present the system handles up to two datasets (binary classification), preferably of opposite features, such as positive and negative in sentiment analysis (SA), although the applicability of the system is not limited to SA. The user needs to prepare two separate files containing the sentences from the two corpora to be compared. The contents of these files are contrasted with each other in the process of automatic evaluation. If the input consists of only one corpus the system will simply produce the most frequent patterns (In this paper we use the words pattern and n-gram interchangeably) for the corpus. Dataset Pre-processing. The provided sentences can be in an unprocessed form. In such a situation processed elements will consist of words (sentence tokens). However, SPASS allows any pre-processing of the sentence contents, thus making possible any kind of generalization the user might wish to apply. The experiments can be repeated with different kinds of pre-processing to check how the pre-processing influences the results. The examples of pre-processing are represented in Table 1. Table 1 Three examples of pre-processing of a sentence in Japanese; N = noun, TOP = topic marker, ADV = adverbial particle, ADJ = adjective, COP = copula, INT = interjection, EXCL = exclamative mark Sentence: 今日はなんて気持ちいい日なんだ! Transliteration: Kyōwanantekimochiiihinanda! Meaning: Today TOP what pleasant day COP EXCL Translation: What a pleasant day it is today! Preprocessing examples 1. Words: Kyō wa nante kimochi ii hi nanda! 2. POS: N TOP ADV N ADJ N COP EXCL 3. Words+POS: Kyō[N] wa[top] nante[adv] kimochi[n] ii[adj] hi[n] nanda[cop]![excl] 233 Fig. 1. General overview of SPASS divided into research support and paper writing support parts In those examples a sentence in Japanese is pre-processed in the three following ways: Tokenization: All words, punctuation marks, etc. are separated by spaces; Parts of speech (POS): Words are replaced with their representative parts of speech; Tokens with POS: Both words and POS information is included in one element. In theory, the more generalized a sentence is, the less unique patterns (n-grams) it will produce, but the produced patterns will be more frequent. This can be explained by comparing tokenized sentence with its POS representation. For example, in the sentence from Table 1 we can see that a simple phrase kimochi ii ( feeling good/pleasant ) can be represented by 234 a POS pattern N ADJ. We can easily assume that there will be more N ADJ patterns than kimochi ii, since many word combinations can be represented by this morphological pattern. In other terms, there are more words in the dictionary than POS labels. Therefore, POS patterns will come in less variety but with a higher occurrence frequency. By comparing the result of classification using different pre-processing methods we can find out whether it is more effective to represent sentences as more generalized or as more specific Experiment Preparation Module The initial phase in the system consists of preparation of data for the experiments. In this phase the datasets are prepared for an n-fold cross-validation test. This setting assumes that the provided data sets are first divided into n parts. Next, n-1 parts are used for training and the remaining one for testing. This procedure is performed n times so every part is used in both training and testing. The number of folds in n-fold cross-validation can be selected by the user with one simple parameter. For example, assuming the system is launched as $ bash main.sh The user can perform a 5-fold cross validation by adding a parameter 5, like below. $ bash main.sh 5 The default experiment setup is 10-fold cross-validation. Setting the parameter to 1 will perform a test in which test data is the same as training data. A special additional parameter is -loo in which the test is performed under the leave-one-out (LOO) condition. In this setting all instances except one are used for training. The one left is used as a test data. The test is performed as many times as the number of all instances in the data set. For example, LOO cross validation test on a set of 35 sentences will perform the test 35 times. To speed up the process of validation, all tests are performed in parallel Pattern List Generation Module The next step consists of generation of all patterns from both provided corpora. It is possible to extract patterns of all lengths. However, an informal maximum length of n-grams used in the literature is either 5-grams (applied in Google N-gram Corpus English version 4 ), or 7-grams (applied in Google N-gram Corpus Japanese version 5 ). This length limit was set experimentally with an assumption that longer n-grams do not yield sufficiently high frequencies. The difference between English and Japanese comes from the fact that Japanese sentences contain more grammatical particles, which means that extracting an n-gram of the same length for both languages will come with less amount of meaning for Japanese. In our system we set the default as 6-grams, although the setting can be modified freely by the users 235 Based on the above assumptions the system automatically extracts frequent sentence patterns distinguishable for a corpus. Firstly, all possible n-grams are generated from all elements of a sentence. From all the generated patterns only those which appear in each corpus more than once are retained as frequent patterns appearing in a given corpus. Those appearing only once are considered as not useful and rejected as pseudo-patterns. The occurrences of patterns O are used to calculate pattern weight w j. The normalized weight w j is calculated, according to equation 1, as a ratio of all occurrences from one corpus O pos to the sum of all occurrences in both corpora O pos +O neg. The weight of each pattern is also normalized to fit in range from +1 (representing purely positive patterns) to -1 (representing purely negative patterns). The normalization is achieved by subtracting 0.5 from the initial score and multiplying this intermediate product by 2. w j = O pos O pos + O neg 05. * 2 (1) The weight is further modified in several ways. Two features are important in weight calculation. A pattern is more representative for a corpus when, firstly, the longer the pattern is (length k), and the more often it appears in the corpus (occurrence O). Thus, the weight can be modified by awarding length, awarding length and occurrence. The formulas for modified pattern weight are represented for the length awarded weight w l modification in equation 2, and for the length and occurrence awarded weight w lo modification in equation 3 w = w * k l w = w * k* O lo j j (2) (3) The list of frequent patterns created in the process of pattern generation and extraction can be also further modified. When two collections of sentences of opposite features (such as positive vs. negative ) are compared, a generated list of patterns will contain patterns that appear uniquely in only one of the sides (e.g. uniquely positive patterns and uniquely negative patterns) or in both (ambiguous patterns). Therefore the pattern list can be further modified by erasing: all ambiguous patterns, only those ambiguous patterns which appear in the same number on both sides 6. All of the above situations represent separate conditions automatically verified in the process of evaluation in the text classification task using the generated pattern lists. With these settings there is over a dozen of conditions for each of which the n-fold cross validation test is performed. 6 Further called zero patterns as their normalized weight is equal to 0. Text Classification Module In the text classification experiment each analysed item (a sentence) is given a score. The score is calculated using the pattern list generated in the Experiment Preparation Module. There is a wide variety of algorithms applicable in text classification with which the calculation of scores can be performed. However, in the initial version of SPASS we used simple settings, which will be upgraded along the development of the system. Specifically, the score of a sentence is calculated as a sum of weights of all patterns matched for a certain sentence, like in equation 4. ( ) score = w, 1 w 1 j j (4) In the future we will increase the number of applied classification algorithms, including all of the standard algorithms such as Neural Networks and Support Vector Machines. The use of a simple algorithm in the initial version allowed us to thoroughly test other parts of the system. The score, calculated for each sentence, is automatically evaluated using sliding of the threshold window. For example, under the condition that above threshold 0 all sentences are considered positive, a sentence which got a score of 0.5 will be classified as positive. However, if the initial collection of sentences was biased toward one of the sides (e.g., more sentences of one kind, or the sentences were longer, etc.), there will be more patterns of a certain sort. Thus, to avoid bias in the results, instead of applying a rule of thumb, threshold is automatically optimized and all settings are automatically verified to choose the best model Contingency Table Generation Module After the scores are calculated for all sentences the system calculates the contingency table. Depending on whether the sentence actually was positive or negative (all test sentences represent Gold Standard) the score becomes
Related Search
Similar documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks