Parsing by Example. Diplomarbeit der Philosophisch-naturwissenschaftlichen Fakultät der Universität Bern. vorgelegt von. - PDF

Parsing by Example Diplomarbeit der Philosophisch-naturwissenschaftlichen Fakultät der Universität Bern vorgelegt von Markus Kobel März 2005 Leiter der Arbeit: Prof. Dr. Oscar Nierstrasz Prof. Dr. Horst

Please download to get full document.

View again

of 77
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.


Publish on:

Views: 16 | Pages: 77

Extension: PDF | Download: 0

Parsing by Example Diplomarbeit der Philosophisch-naturwissenschaftlichen Fakultät der Universität Bern vorgelegt von Markus Kobel März 2005 Leiter der Arbeit: Prof. Dr. Oscar Nierstrasz Prof. Dr. Horst Bunke Tudor Gîrba Prof. Dr. Michele Lanza Institut für Informatik und angewandte Mathematik The address of the author: Markus Kobel Rosenweg 4 CH-3303 Jegenstorf kobel/ Abstract We live in a world where we are surrounded with information technology. The software systems around us are countless. All of those systems have been written once and must be maintained today. While a system evolves it becomes difficult to maintain. We use reengineering tools today to simplify maintenance tasks. With the support of such tools we can change the form of software systems in a way that makes them easier to analyze. Before we can use any reengineering tool with a software system we must reverse engineer that system. To reverse engineer a software system means that we need to build a model from the system. This model represents our system in a more abstract way than the source code itself does. The way from the source code to the model is often a problem. If a reengineering tool supports a specific model the maintainers of that tool must provide a parser for every programming language they want to support. Such parsers translate source code written in a particular language into a model. There are so many languages used in systems today that it is not possible to support all of them. Additionally, the languages themselves evolve and so we need parsers for every version and every dialect of a programming language. There are a number of approaches to solve that problem (for example fuzzy parsing). Most of these approaches are not flexible enough for today s needs: We cannot adapt them to another programming language or if we can we need a lot of knowledge about the language and about the whole parsing technique. Depending on the technique that we use we must write a parser or at least a grammar as a basis for a parser generator. In most of the cases this is a difficult and time-consuming task. Our idea is to build an application that generates parsers based on mapping examples. A mapping example is a section in the source code to which we assign an element in our target model. Based on these examples, our application builds grammars and generates a parser. If the parser fails to parse some code our application asks the user to provide more examples. This approach is flexible enough to work with a software system written in an arbitrary programming language. The user does not need to have knowledge about parsing. However, he should be able to recognize the elements in the source code that he wants to map on the elements in the target model. We prove the flexibility of that approach with our reference implementation called CodeSnooper. This application works with any input. As target model we take the FAMIX model that is used by the MOOSE reengineering environment. iii Acknowledgements First of all, I want to thank Prof. Dr. Oscar Nierstrasz for giving me the possibility to do this work in his group. The subject of this work was in his mind for a long time now. I always felt that he believes in my work during the last year. Thanks go also to Tudor Gîrba who supervised this work. With his suggestions, CodeSnooper became a usable tool. I also want to thank Prof. Dr. Horst Bunke for his uncomplicated supervision. I often forgot to inform him about the state of this work - even so, he was never angry. Special thanks go to Prof. Dr. Michele Lanza. Even though he went away from Berne I always knew that I can reach him if it was needed. I want to thank you, Michele, for all you have done for me - first as a teacher, then as my boss and supervisor and in the end as a friend. There are a lot of students i got to know during the last few years. If I want to name all of them this list becomes endless and I would forget someone for sure - so, I just thank all of you for providing a pleasant atmosphere during my studies. Special thanks go to the students that shared many days in the lab with me: Thomas Bühler, Adrian Lienhard and Michael Meer. Another group of (former) students I want to thank are the legendary bitdefenders composed of Michael Locher, Mauricio Seeberger, Christoph Hofer and Stefan Leuenberger. Thank you for the endless discussions and coffee klatsch... ;) Many thanks go to my whole family. A special place in my life have my parents - they supported my will in the last years although it wasn t easy. My sister and my brother do have a special place, too. Thank you for sharing your youth with me. I know that my mind often went in another direction than yours. There is one last person that I want to mention by name. It s my godfather Walter Lüthi - thank you for your support and your belief in my work. I know you often had to wait a long time... Last but not least I want to thank all of my friends that I got to know before and beside my studies. Unfortunately not all friendships survived my studies - but the otherones are stronger than before. If you find a solution and become attached to it, the solution may become your next problem. Markus Kobel March 2005 v Contents Abstract Acknowledgements iii v 1 Introduction Our Approach: Parsing based on Examples Goals of this Work Structure of this Document State of the Art Parsing for Reverse Engineering Different Approaches Fuzzy Parsing Island Grammars RegReg Generalized Parsing DURA Problems with Different Approaches Parsing by Example Scanner Mapping Examples Example Tree Components of a Grammar Language Constructs Generation of a Grammar Nodes without Target Reduce Actions Whitespace Merge Productions Parser Generation vii viii CONTENTS 3.8 Parse Tree Model Export External Representation of the Parser Definition Validation JBoss j2ee Package First Iteration: Detection of Classes Second Iteration: Detection of Classes and Methods Third Iteration: Detection of Classes, Methods and Attributes Comparison with another Parser Ruby Libraries First Iteration: Detection of Classes Second Iteration: Detection of Classes and Methods Third Iteration: Detection of Classes, Methods and Attributes Comparison with another Parser Conclusions Summary of Problems with Ruby Source Code Discussion General Statements Problems and Possible Solutions Ambiguous Grammars False Positives and False Negatives One Grammar is not Enough Language Constructs in Unexpected Contexts Complex Control Structures Packages in Java Inner Classes Benefits of our Approach Limitations of our Approach Conclusion and Future Work Parsing by Example compared with other Approaches Conclusion Future Work A Tools 61 A.1 CodeSnooper A.1.1 Quick Start A.1.2 Requirements viii CONTENTS ix A.1.3 User Interface A.2 MOOSE B Definitions 69 B.1 Document Type Definition for our Parser Definitions B.2 Validation Environment B.3 Grammar Example ix x CONTENTS x Chapter 1 Introduction A program that is used in a real-world environment must change, or become progressively less useful in that environment. As a program evolves, it becomes more complex, and extra resources are needed to preserve and simplify its structure. [LEHM 85] The field of software reengineering becomes more and more important. But we can only reengineer a system when we understand its structure. In order to understand that structure we have to reverse engineer that system. There are several approaches to reverse engineer a program. Among them are the following approaches: reading existing documentation and source code, running the software and/or generating and analyzing execution traces, interviewing the users and developers, using various tools, analyzing the version history, et cetera [DEME 02]. A tool that supports us in getting an overview of a software system must somehow translate that system into a model. This translation is a challenging point. Someone must write a parser that can translate that software system into the model he wants to support. So, the maintainers of such tools must provide a parser for every programming language they want to support. But it is not only the number of languages that is a problem. A language itself also evolves. A parser that works with a specific version and/or dialect could not work with the next version anymore. A parser does two things while processing its input: 1. Split the input into tokens. 2. Find the hierarchical structure of the input. First we have a lexical analyzer (scanner) that splits the input into tokens (point 1). The syntax analyzer takes these tokens as input and generates a syntax tree (point 2). To simplify the task of writing such analyzers Lesk and Johnson published papers on Lex (a lexical analyzer generator) [LESK 75] and Yacc (yet another compiler-compiler) [JOHN 75]. There are different implementations of Lex and Yacc available today. For example flex 1 and bison 2 from the GNU project 3. Lex source is a table of regular expressions and corresponding program fragments. This table is translated into a program that reads an input stream, copying it to an output stream and splitting the input into strings which match the given expressions. The program fragments are executed in the order in which the corresponding regular expressions occur in the input stream 2 CHAPTER 1. INTRODUCTION Yacc takes as input the grammar of a language. Each production in that grammar matches a specific structure of the language. The user of Yacc specifies an action for each production. Yacc converts that grammar into a program that reads an input stream and whenever a structure is found it executes the action of the corresponding production. Both Lex and Yacc produce C code. There are also similar tools available for other programming languages. When speaking about the input of a lexical analyzer generator we use the term scanner definition. We use a scanner definition together with a grammar as input for a parser generator. It is easier to write a scanner definition and a grammar than a parser. But writing a grammar is still a time-consuming and difficult task: This is a problem if we want to parse a lot of different inputs. There are several approaches known today that address this problem. Often we do not need to fully parse a system to have a valid model. There are several approaches that extract a partial model from the input. One approach is Fuzzy Parsing: the idea is that we have some anchor terminals and based on them we can extract a partial source code model [KLUS 03]. We can also build a parser based on a Island Grammar where we define some productions for islands (things we want to recognize in the source code) and every other production leads to water (things we do not care about) [MOON 01]. We also know about Generalized Parsing: such a parser can handle ambiguities in the provided grammar and follows every possibility to parse the input [VAN 98]. There are also parser generators like DURA that produce parsers with additional features like backtracking [BLAS 01]. No matter which approach we use, we must have a good knowledge about the programming language and about the parsing technique. The problem we are addressing in this thesis is how to provide a useful grammar for a parser generator. Our target is to generate a model based on provided mapping examples. A mapping example is a section of some source code to which we assign an element in our target model. Based on such examples we build grammars and with them we generate parsers. With that approach we can get a model from any input in a short time period and without a deep knowledge about parsing techniques. However, we must be able to recognize the elements we want to detect in the given source code. The following points are the core requirements that we want the reference implementation of our approach to fulfill: Reliability: The application must work with any input without crashing. In the worst case it does nothing at all. If it can only parse a part of the system it must provide a valid model from the successfully parsed part. Exception handling: If the generated parser fails to parse the input we want enough feedback to know why this happened. If it does not know how to parse part of the input it must ask us to provide another example. Incremental parsing: We want to build up models in an iterative and incremental way. As an example for an object-oriented system, we just detect Classes in a first run and then, we add Methods and Attributes to that model in a second run. User interface: We want a user interface where we can specify mapping examples in an intuitive way. 2 1.1. OUR APPROACH: PARSING BASED ON EXAMPLES Our Approach: Parsing based on Examples We need a model of the system we want to analyze. In most cases we do not need to fully parse that system. In a first step it is often enough if we can extract a partial model of the system. For example in an object-oriented system that could be Classes in a first step and Methods and Attributes in a second iteration. We specify only examples of the mapping between our input and elements of a target model. Based on this information we generate a parser that can translate the whole input into a model. When generating the parser we want to generalize the given examples in order that the parser can detect other instances of these target elements as well. If there are sections in the source code that the parser cannot recognize we specify some more examples. We use this additional information together with the initial examples as basis for the next parser. With this approach we extract information from a system written in any language. If we have a system that is written in a language that we do not know, usually we do not have a precise parser that can give us a model for that system. When we analyze such a system and recognize some instances of the elements in our target model we can specify them as examples: We can extract a model from a system without knowing the language. We validate our approach with the support of our reference implementation called CodeSnooper. To work with CodeSnooper the user does not need any knowledge about parsing. CodeSnooper works with any input, generates grammars and, based on them, parsers. It gives feedback if a parser fails to parse any part of the input. With our approach we mix precise parsing with elements from fuzzy parsing and island grammars. We do not just skip input until we find a specific anchor as is done in fuzzy parsing. We only ignore certain tokens in well defined contexts. We can say that the grammar we generate is some kind of island grammar. Since we generate our grammars based on examples we can get more complex grammars in less time compared to writing them by hand. The parser that we get from our compiler-compiler is a LALR(1) parser - that means it is a Look Ahead Left to right parser and produces a Rightmost derivation with a look ahead of 1 input symbol. There is a large set of context-free grammars from which we can generate LALR(1) parsers. LR parsers are commonly used by programming language compilers and therefore used for this work. In Figure 1.1 we see how we arrive from source code to a model: We generate a Node from our input (source code). We specify examples of the mapping between source code and our target model directly on this Node. This Node becomes a tree because we split it up and add subnodes when specifying examples. This tree is explained in Section 3.3. We have two possibilities to get a model from that tree: 1. We export the example tree directly as a model. The model contains exactly the elements that we specified as examples. 2. We generate a grammar from our examples. We give this grammar to the compiler-compiler and generate a parser that we use for parsing the whole source code. The parse tree that we get is built with the same Nodes as our examples. We export this parse tree as a model. The way from the source code to a model is explained in detail in Chapter 3. 3 4 CHAPTER 1. INTRODUCTION Figure 1.1: The way from source code to a model. With such a design we can be sure to get a raw model from any software system. Even if our generated parser is not able to parse a whole system we can use that application as a direct interface between any source code and our target model - at least, we can always translate the given examples into a valid model. 1.2 Goals of this Work Here are some questions that we would like to answer in this work: Is it possible to build up an application that can generate grammars/parsers on the fly based on some mapping examples? What information is needed from the user at least to build a usable grammar? How many examples do we need to build a usable grammar? How much time do we need to make enough examples for building up a usable grammar? Is there information in the source code that we cannot transform into a model with our approach? What does a useful user interface for such a tool look like? 4 1.3. STRUCTURE OF THIS DOCUMENT Structure of this Document This document consists of the following chapters: In Chapter 2 we discuss the parsing techniques that are used in reengineering environments at the moment. We investigate a few approaches and their assets and drawbacks. We take a look at the problems that these approaches cannot solve. In Chapter 3 we present our approach Parsing by example. We discuss every step that is needed to parse a system with our approach. We look at the specification of examples and the inference of grammar rules based on these examples. We also discuss all the components of our grammar. We explain how we build our parse tree and how we can export a model based on such a tree. Finally, we show how we can store a complete parser definition. In Chapter 4 we summarize experiences that we made during the work with our application. We take two examples of source code that we transformed into a model with our approach and compare them with the models that we get while using a conventional method. In Chapter 5 we evaluate our approach. We discuss the problems we had and provide possible solutions. We also look at the limitations of our approach. Chapter 6 contains a summary of our work and experiences we gained during this work. We compare our approach with the other ones. We discuss if we can solve the problems that other approaches cannot solve. We also look at possible future work. Appendix A contains an overview of the application that we implemented during this work. This tool is named CodeSnooper and we used it as a proof of concept for our approach. We look at the requirements that we had and explain the functionality that was needed in order to get a usable tool. In Appendix B we provide the document type definition for our parser definitions, give an overview of our validation environment and include an example of a complete grammar. 5 6 CHAPTER 1. INTRODUCTION 6 Chapter 2 State of the Art 2.1 Parsing for Reverse Engineering Today, there are many software systems that are difficult to maintain because they evolved over time. We use reengineering tools to simplify maintenance tasks. Often, there is a problem to transform these systems into a model that can be used by a reengineering tool. There is a large number of programming languages and dialects. One would need a parser for every single language and its dialects. Because of that almost every reengineering environment provides an external representation of the model it uses in one or more common file formats. For example the MOOSE reengineering environment supports the import from CDIF and from XMI. Source code written in C/C++, Java or Ada can be parsed and outputted in the CDIF format by Sniff+ 1. With such external parsers the maintainers of a tool can extend its support for different languages. The drawback is that they must rely on external tools. The maintainers of reengineering frameworks must provide a parser
Related Search
Similar documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks