On Text Similarity and the Development of a Language Statistics Server B J Ö R N A N D R I S T - PDF

Description
On Text Similarity and the Development of a Language Statistics Server B J Ö R N A N D R I S T Master of Science Thesis Stockholm, Sweden 2006 On Text Similarity and the Development of a Language Statistics

Please download to get full document.

View again

of 26
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information
Category:

Beauty

Publish on:

Views: 14 | Pages: 26

Extension: PDF | Download: 0

Share
Transcript
On Text Similarity and the Development of a Language Statistics Server B J Ö R N A N D R I S T Master of Science Thesis Stockholm, Sweden 2006 On Text Similarity and the Development of a Language Statistics Server B J Ö R N A N D R I S T Master s Thesis in Computer Science (20 credits) at the IT Program Royal Institute of Technology year 2006 Supervisor at CSC was Martin Hassel Examiner was Stefan Arnborg TRITA-CSC-E 2006:092 ISRN-KTH/CSC/E--06/092--SE ISSN Royal Institute of Technology School of Computer Science and Communication KTH CSC SE Stockholm, Sweden URL: On text similarity and the development of a language statistics server Abstract This thesis describes the design and implementation of StatServ, a language statistics server, as well as TextSim, a system for determining the similarity between texts. Further, this thesis shows the results of a comparison between two various configurations of TextSim; one with and one without linguistic analysis. The aim of the language statistics server is to provide a common service for other NLP applications that could benefit from corpus statistics. The architecture of the server is modular and hence new functionality could be added to the server in the future. Statistics can be added to the server by passing the server new documents to be analyzed. To assure the reliability of the statistics stored on the server, a text should not be allowed to affect the statistics more than once. In order to detect similar documents sent to the server the text similarity system was implemented and integrated into the server. TextSim was implemented using a Machine Learning approach and can be trained on example data. To evaluate and compare the two models of TextSim we used two sets of examples: a set of examples generated automatically from programs and a set of examples acquired from two assessors. Depending on the type of documents, we found the model using linguistic analysis to perform equally well or better than the model not using linguistic analysis. Likhet mellan texter samt utveckling av en språkstatistikserver Sammanfattning Denna rapport beskriver utformningen och implementationen av StatServ, en språkstatistikserver, och TextSim, ett system för att avgöra likheten mellan texter. Vidare redogör denna uppsats för resultaten av en jämförelse mellan två olika konfigurationer av TextSim; en med och en utan lingvisktisk analys. Syftet med språkstatistikservern är att tillhandahålla tjänster gemensamma för andra tillämpningar inom språkteknologi som kan nyttja korpusstatistik. Arkitekturen av servern är modulär varför ny funktionallitet kan läggas till i efterhand. Statistik kan adderas till servern genom att skicka nya dokument för analys. För att säkerställa validiteten av statistiken som lagras på servern får en text inte påverka statistiken mer än en gång. I syfte att detektera liknande dokument som skickas till servern implementerades textlikhetssystemet som därefter integrerades i servern. TextSim implementerades med en maskininlärningsmetod och kan tränas på exempeldata. För att jämföra de två konfigurationerna av TextSim användes två exempelmängder: en mängd automatiskt genererad från program och en mängd insamlad från två försökspersoner. Den konfiguration som använde lingvistisk analys visade lika bra eller bättre förmåga att uppskatta likheten mellan texter beroende på texternas typ. Contents 1 Introduction Background Problem Aim Thesis outline Method survey Identification of similarities in texts Developing a language statistics server Theoretical background Machine learning The perceptron Training and evaluation Resemblance and containment A method for unbiased estimation Producing the sketches Fingerprinting Rabin fingerprinting scheme Fingerprint collisions Implementing a text similarity system Introduction Similarity between documents Features for document comparison A naive approach An ML approach A perceptron model Training Implementation Sketch parameters Counting the duplicated shingles Complexity and performance Evaluation of the text similarity system Introduction Hypothesis Method overview System configurations 5.3 Data sets Reference documents Document variants Example types The generated example set The assessor example set Performance Performance on generated examples Performance on the assessor example set Discussion Architecture and design System overview Behavioral view Layered architecture Interaction layer The Java servlet technology Choice of communication protocol The client request Generating XML Application layer Persistence layer Connection pooling Indexing Stored procedures Denormalization Utility packages Caching XML Design patterns Controller Concrete factory Value Object Facade Data Access Object Strategy Modules Extending the server with new modules A module for lemmas and tags Discussion Summary Future work The language statistics server Text similarity Acknowledgements 37 References 39 Chapter 1 Introduction 1.1 Background Language statistics is of great importance in Natural Language Processing (NLP) as a substitute or complement to rule based methods. In our case, language statistics will refer to statistics calculated on written language. The access to digital documents of natural language increases daily. These documents are all potential subjects for linguistic applications using statistical methods. The field of applications grows rapidly. Statistical NLP methods can be used in application methods, acquisition methods, and evaluation methods [Nivre 02]. Here follows a few examples of frequently used types of statistics within NLP: Term frequency (tf) the number of occurrences of a term in a document or a collection of documents. Inverse document frequency (idf) a measure of how rare a term is in a collection of documents. Term weighting weights the term in a document according to its relevance to some certain criteria. A well known implementation is the tf idf weighting scheme [Jurafsky & Martin 00]. Such types of statistics can be used in applications such as word class disambiguation [Church 88], automatic summarization [Lin & Hovy 97], machine translation, literary detective work such as authorship attribution [Oakes 98] etc. Language statistics can be based on data in a single document or on a corpus (a large and structured collection of documents). In this thesis the former will be referred to as local statistics and the latter reference statistics. Some types of statistics require both local and reference statistics. E.g. the term weighting scheme, tf idf weighting, mentioned above requires tf which can be based on local statistics and idf which is based on reference statistics. Methods to calculate local statistics can be implemented as a part of statistical NLP systems. When using reference statistics in a certain application, each instance of that system needs to hold a copy of the reference statistics. Many NLP systems are based on similar types of statistics, therefore a common language statistics server could preferably provide the systems with statistics and thus facilitate the development of the systems. A set of requirements would need to be fulfilled for a language statistics server to be useful. Here follows the most important functional and non-functional requirements of such a server: 1 CHAPTER 1. INTRODUCTION 1. Parallel request handling. Incoming requests to the server must be handled concurrently by separate processes or threads. 2. Modularity. New statistical modules responsible for providing a certain type of statistics should be able to extend the server functionality. 3. Handling of local and reference statistics. Each module should be able to provide local and reference statistics relevant for a certain document passed to the server. 4. Handling of different corpora. Reference statistics are calculated on documents from one or many sets of documents, i.e. corpora. A request for reference statistics must include references to the document sets, on which the statistics should be calculated. Further, it should be possible to add and remove document sets. 5. Standardized protocol for request and response handling. The protocol for communicating with the server should be based on both standardized and platform independent techniques. 6. Performance. Response time is a critical factor especially when adding and fetching reference statistics. Therefore, response times should be kept at a minimum. 7. Automatic detection of similar documents. In order to assure the reliability of the references statistics, a text must not affect the reference statistics more than once. This requires an efficient and automated mechanism for identifying similar texts. All incoming texts are written in Swedish 1. The requirements listed above concern many interesting and profound aspects and are subjects for further research. The first six requirements have been fulfilled by implementing conventional techniques for building web server applications. It is however the last requirement concerning text similarity, that this project has aimed to investigate. Methods for determining the similarities of documents can be found in the field of Information Retrieval, IR. Two well known applications are: SCAM: a copy detection mechanism for digital libraries [Shivakumar et al. 95] the identification of nearly identical documents used by search engines in the purpose of returning only one variant of similar texts. [Broder 97] In this Master s project two mathematical notions called resemblance and containment [Broder 97] have been examined for determining the similarity of documents. Sketches of documents, created by hashing sequences of strings, are compared to determine the resemblance and the containment. This technique originates from the work by Andrei Broder [Broder 93] [Broder 97] and can be applied to different types of data such as MIDI documents [Mitzenmacher & Owen 01]. Our language statistics server is restricted to documents written in Swedish. Therefore we can make use of linguistic analysis in an attempt to improve the technique using resemblance and containment. A grammar checker for Swedish called Granska [Domeij et al. 99] can been used for linguistic analysis of documents. Granska can, among other things, be utilized for tokenizing, lemmatizing and part-of-speech tagging. 1 Theoretically, our method could be used for any language, assuming that an underlying analyzation system is used for the language in question. 2 1.2. PROBLEM The result from tokenizing a document is the list of tokens (the words and delimiters etc) contained in the document. While the result from lemmatizing a document is a list of lemmas. A lemma is the canonical form of a token. E.g. lemmatizing transforms all verb tokens to their infinitive form, noun tokens to their singular form etc. Part-of-speech tagging (also known as PoS tagging) is the process of determining the part-of-speech tag or PoS tag for a certain token in a sentence. A part-of-speech tag is a lexical category. Table 1.1 gives an example of a sentence transformed into tokens, lemmas, and part-of-speech tags. Original text: Han springer fortare! Tokens: Han springer fortare! Lemmas: han springa fort! Part-of-speech tags: pn vb ab mad Table 1.1. A sentence transformed into tokens, lemmas and part-of-speech tags. (pn, vb, ab and mad are PoS tags as used in Granska.) We aim to develop a system to be capable of mirroring the human perception of similarity between documents. Systems which are able to learn a concept, in our case the concept of similarity between documents, can preferably be implemented using machine learning (ML) algorithms [Mitchell 97]. A well established type of ML algorithms is the Artificial Neural Network (ANN) algorithms [Callan 99]. An ANN can be trained to learn a concept given a set of training instances. The learning process (i.e. training) of an ANN can be either unsupervised or supervised. Unsupervised learning is based on algorithms which can improve by processing input with unknown output. Supervised learning, on the other hand, requires the training data to contain both the input and the required output for each input pattern. When an ANN produces an output which differs from the required output, the ANN is updated in the purpose of learning to classify the instance correctly. Training data for a supervised ANN can be acquired by having people determine the similarity between documents. In this project we have used supervised learning solely. 1.2 Problem To build a successful language statistics server, the requirements listed above need to be fulfilled. How can the architecture of a modular language statistics server which provides reliable local and reference statistics be defined? We aim to build an efficient system for determining the similarity between documents for use in the statistics server. Examples of such systems have successfully been implemented with the use of containment and resemblance. We wish to see if an ANN system using containment and resemblance, treating texts written in Swedish, can be improved by using tokens, lemmas and part-of-speech tags as input. Can an ANN system, with the use of containment and resemblance, based on tokens, lemmas, and PoS tags outperform a similar system solely based on tokens? 3 CHAPTER 1. INTRODUCTION 1.3 Aim The aim of this Master s project is: 1. to state how an architecture and a design of a modular language statistics server can be defined. 2. to implement an effective text similarity system to be used in conjunction with the language statistics server in the purpose of detecting similar texts. 3. to examine whether identification of similar texts using resemblance and containment can be improved by using tokens, lemmas and PoS tags instead of solely using tokens. 1.4 Thesis outline The rest of this thesis is outlined as follows: Chapter 2 describes the overall process of this project. Chapter 3 presents the theoretical foundation that the text similarity system is based on. Chapter 4 describes the text similarity system that was implemented in this project. Chapter 5 examines wheter a text similarity system based on tokens, lemmas and PoS tags can outperform a similar system based on tokens solely. Chapter 6 is a technical report of the architecture and the design of the language statistics server that was implemented in this project. Finally, Chapter 7 includes a summary of the achieved results and conclusions of this project and suggests future work. 4 Chapter 2 Method survey This project has involved three distinct tasks: developing a text similarity system able to identify similarities between documents. examining whether the text similarity system could be improved by using linguistic analysis of the texts. defining an architecture and a design of a language statistics server. 2.1 Identification of similarities in texts Within this project, a system named TextSim was developed in Java for finding similarities in text documents. TextSim is based on resemblance and containment combined with a supervised machine learning algorithm. TextSim can be trained on any combination of tokens, lemmas, and PoS tags. Two instances of the TextSim system were trained: one based on tokens and one based on tokens, lemmas, and PoS tags. Evaluation data collected from two assessors were used for comparison between the systems. This will be discussed in detail in Chapter Developing a language statistics server To define an architecture and a design of a language statistics server a prototype which fulfills the requirements outlined in Chapter 1 was developed in Java and SQL. The prototype is named StatServ and realizes the suggested architecture. TextSim was integrated into StatServ to classify documents as similar or non-similar to the existing documents in the database. To reach a sound architecture, object oriented principles were used throughout the development process. Design and architectural patterns were used and are documented in this thesis (see Chapter 6). The development process was iterative where each iteration involved designing, implementation and testing. Chapter 6 is a technical report of the architecture and the design of StatServ. The final prototype was tested with multiple concurrent clients requesting reference statistics and local statistics from document sets which contained more than documents. A module called LemmaTagModule was developed in the purpose of testing the modularity of the server. 5 Chapter 3 Theoretical background This chapter describes the theory behind the text identification system TextSim which is used by StatServ (the language statistics server) to identify similar texts. 3.1 Machine learning Mitchell (1997) states: Machine Learning is the study of computer algorithms that improve automatically through experience. Given a set of training instances, an ML model can be trained to grasp a particular concept. Learning can be divided into two categories: supervised learning and unsupervised learning. Supervised learning requires training examples. An example consists of both an instance and the desired output target. Instances will also be referred to as input patterns. During training, the model is updated whenever it generates an output that differs from the desired output. When a model has been generated the aim is that the model can produce correct output on unseen input patterns. A well approved category of ML algorithms is the Artificial Neural Network (ANN) algorithms [Callan 99]. Successful applications using ANNs include, among others, face recognition and optical character recognition [Mitchell 97]. ANNs consist of one or many connected layers of computing units called neurons. Each neuron has one or several input nodes and one output node. When input is received from the input nodes, the neuron generates an output signal depending on the input values. The input nodes are normally attached with a corresponding weight which represents the strength of the connection between the input node and the neuron. In order to compare ML models, trained to learn the same concept, it is essential to be able to evaluate the models. Many sound strategies for evaluating ML models have been obtained from the field of statistics [Witten 00] The perceptron The perceptron, invented by Frank Rosenblatt (1958), is a single layer ANN which can learn linearly separable functions. A perceptron consists of a single neuron connected with an arbitrary number of real value input nodes. Figure 3.1 illustrates a perceptron. The input nodes are represented by a vector x. A weight is attached to each node to represent the strength of the connection between the input node and the neuron. The weights are also represented by a vector w of real number weights. A linear combination of 7 CHAPTER 3. THEORETICAL BACKGROUND x 1 w 1 θ x 2 w 2 s sign(s) Y x n w n Figure 3.1. A perceptron using the sign function to determine its output. x = (x 1, x 2,..., x n) are inputs while w = (w 1, w 2,..., w n) are the attached weights. The sum s = P n i=1 x i w i θ, is passed as an argument to the activation function sign : R {1, 1}. the n inputs and the corresponding weights produces an activation level given by n x i w i = x w. i=1 For each input pattern presented for the perceptron, an activation function determines the output of the perceptron based on the activation level and a threshold value θ. We use the sign function as the activation function: { 1 if x 0, sign(x) = 1 otherwise. Given the threshold value θ and the input weights w, the output Y of the perceptron for input x is given by Y ( x) = sign( x w θ). Further, if the desired output target Y t is known, the error e is given by e( x) = Y t ( x) Y ( x). The knowledge of the perceptron is stored in its weights and its threshold value. For each training example, the perceptron uses the perceptron learn
Related Search
Similar documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks