Oleksij Volkov, Simona Ramanauskaitė Šiauliai university, Faculty of Technology. Introduction. Word Search in SQL-Based Relational Databases - PDF

Description
ISSN JAUNŲJŲ MOKSLININKŲ DARBAI. Nr. 2 (40) Research of Word Search Algorithms based on relational database Oleksij Volkov, Simona Ramanauskaitė Šiauliai university, Faculty of Technology

Please download to get full document.

View again

of 6
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information
Category:

Crosswords

Publish on:

Views: 16 | Pages: 6

Extension: PDF | Download: 0

Share
Transcript
ISSN JAUNŲJŲ MOKSLININKŲ DARBAI. Nr. 2 (40) Research of Word Search Algorithms based on relational database Oleksij Volkov, Simona Ramanauskaitė Šiauliai university, Faculty of Technology Introduction Word games are a simple and fun way to spend time. It helps to improve a person s vocabulary, expand erudition, train memory and intelligence, and develop logic and associative thinking. There exists different types of games which require different game logic, and different -finding algorithms. The optimization of search algorithms is required to design a fully functioning and efficient system. The aim of this work is to research the efficiency of different kinds of search algorithms, using an SQLbased relational database. Results of this research could provide valuable information for the design of an efficient helping system for solving games. String Search Algorithms. To find a suitable -for- game requires review of, and to find one. This requires certain resources to compare all s from the to the search pattern. There are different algorithms for string matching problem. As R. Rivest [1] states, to search for a pattern of length m in a text of length n (where n m) the search time is 0(n) in the worst case (for fixed m). Moreover, in the worst case, at least n - m + 1 character must be inspected. However the complexity of string search can vary depending on the algorithm used. The most known unoptimized string search algorithm is naive or brute force algorithm. It tries to match any substring of length m in the text with the pattern [2], therefore n-m+1different combination of one must be checked to find all pattern matches. This algorithm matches all pattern letters for all combinations. Meanwhile Knuth, Morris, and Pratt s algorithm [3] optimizes it and does not examine all the patterns. This algorithm comparing two strings compares it until the first unmatched letters; therefore, the complexity of this algorithm is a little bit smaller and the search time is more dependable on the search pattern and comparable string. To reduce the complexity of the search algorithm Boyer and Moore [4] proposed to search the pattern from right to left. If no mismatch occurs, then the pattern has been found. Otherwise, the pattern matching is moved to the left by an amount of letters which is calculated using heuristic algorithm (adjusting to the difference of the matching pattern). The complexity research of these string search algorithms depend on the length of the pattern was carried out by Ricardo A. Baeza-Yates [5] (see. Fig. 1). It showed the decrease of the Boyer and Moore algorithm, compared to the naive and Knuth, Morris, Pratt algorithm. However all these classic algorithms were proposed in 1977; meanwhile, new information technologies allow different solutions to the problem. Word Search in SQL-Based Relational Databases Relational database model is an organization of data into collections of two-dimensional tables called relations [6]. This type of database management systems at present is the most common in different kinds of information systems [7, 8]. The popularity of relational database management systems is increased by support of SQL, which allowed clear syntax to get required data from relational databases. The Structured Query Language (SQL) is a standardized language used to retrieve and update data stored in relational tables (or databases) [9]. SQL has different type of queries and parameters. However, for string search algorithms the LIKE clause is usually used. It has two wildcards: the percent sign (%) represents zero, one, or multiple characters; the underscore ( _ ) represents a single number or character [10]. The symbols can be used in combinations to compose a desired pattern of search string. Fig. 1. Comparison of complexity of string search algorithms [5] Possible Database Designs for Word Search Based on Letter Patterns The most important element of search systems is the design of a system database: the database has to store the which usually stores a large number of data; a search system usually searches for matching s according to a known pattern; therefore, many connections to the database can be used. 138 technologijos mokslai Informatika The simplest design of a search database is represented in Fig. 2. Fig. 2. Database design No. 1 This architecture is very simple and allows having different dictionaries in the same system. Using this kind of database design, two types of SQL queries can be used to get matching s from the database: To select all matching s from the database according to provided pattern PTRN: SELECT. FROM WHERE. LIKE PTRN. To select all matching s from a specific DICT in the database according to provided pattern PTRN: SELECT. FROM WHERE. = DICT AND. LIKE PTRN. The pattern can be composed of any symbols from the alphabet and percent sign (%) as well as underscore ( _ ) wildcards to represent any unimportant parts of the. However this architecture does not allow for a search according to the length of the. Therefore this database design can be improved by adding an additional field to the table (see Fig. 3). Fig. 3. Database design No. 2 number_of_letters This database design improvement allows for two additional queries to specify the search: To select all matching s from the specific DICT in the database according to a provided pattern PTRN where the length of the is exactly N: SELECT. FROM WHERE. = DICT AND. LIKE PTRN AND.number_of_letters = N. To select all matching s from the specific DICT in the database according to a provided pattern PTRN where the length of the is more than N1 and less than N2: SELECT. FROM WHERE. = DICT AND. LIKE PTRN AND (.number_of_letters N1 AND.number_of_letters N2). There can be different combinations of length limitations; however, these two are the mostly used. The SQL query to get all s with exact length can also be realized with first database design. For this purpose, thee underscore ( _ ) wildcard should be used to identify all unknown letters and their position in the. However, to specify the range of world letters would not be possible with the first database design. To design a system with a wide variety of string search algorithms, a scrabble-type search has to be supported. This type of search usually has no pattern, but just a list of possible letters which have to be used to get a. To do this with the previously described database design would be difficult: all possible combinations of known letters have to be generated; a large number of possible letter combinations can be generated for long set of letters; and SQL is not capable of generating all possible combinations from a known set of letters, etc. To solve this problem, a new improvement to the database design has to be made the database should describe all s by splitting them into letters and associating them to the corresponding (see. Fig 4). letter _id number_of_letters _id letter Fig. 4. Database design No. 3 This architecture would allow for finding all s for a scrabble-type search: To select all matching s from a specific DICT in the database according to a provided set of letters {L1, L2,, LN} where at least M letters of the set would be used: SELECT. FROM letter, WHERE._id = letter._id AND (letter.letter = L1 OR letter.letter = L2 OR OR letter.letter = LN ) AND. = DICT group by letter._id HAVING count(letter._ id) M. This database design would provide all the functionality the system needs. However, it can be extended to search for s by pattern using the table letter, instead of the. This requires adding the position of letter in the only (see. Fig. 5). letter _id number_of_letters Fig. 5. Database design No. 4 _id letter position The position of a letter in a allows us to search by specifying which letter in the set and the letter should be placed in the. Therefore, the query for searching a of specific length can be changed to it: To select all matching s from a specific DICT in the database according to a provided set of letters {L1, L2,, LN} the length of the is exactly M: SELECT. FROM letter, WHERE.number_of_letters = M AND. 139 ISSN JAUNŲJŲ MOKSLININKŲ DARBAI. Nr. 2 (40) _id = letter._id AND ((letter.letter = L1 and letter.position=1) OR (letter.letter = L2 AND letter. position=2) OR OR (letter.letter = LN AND letter. position=n)) AND. = DICT group by letter._id HAVING count(letter._id) N. However, it is unknown if this query would be more efficient than in the previous architecture. Research of Relational Database-Based Word Search Algorithms To define the best database architecture from the proposed ones, a research of its efficiency has to be carried out. For this purpose four different database architectures (described earlier in this paper) will be created in the SQLite database management system. To examine the efficiency of these database architectures, experiments with different number of s in the database have to be done. Therefore, experiments will be executed with 10, 100, 1000 and 5000 s in the database. According to the database architecture, all provided SQL queries should be tested using different patterns or sets of letters. The purpose of the research would be to evaluate the query execution time to obtain a suggested list of s according to the chosen database architecture and the number of s in the database and search pattern or set of letters. Testing will be done with s from the English. With each design of the database two types of requests (complex and simple) will be performed; the average query time will be compared with other designs of databases. Research of Database Design No. 1 To test the performance of the first database design, 4 SQL queries were used to get all the s which start with letter a and differ in conditions: 1. SELECT. FROM WHERE. LIKE a% 2. SELECT. FROM WHERE. LIKE a_% 3. SELECT. FROM WHERE. = 1 AND. LIKE a% 4. SELECT. FROM WHERE. = 1 AND. LIKE a_% Results, obtained from the test are presented in 1 table. As the results of this test showed, the execution time is dependable on the number of s in the database. However, the type of SQL query does not have a great influence on speed (as only one exists in the database), but only the number of found elements differs because of the conditions which were used. Research of Database Design No. 2 The second database design is tested as well as the first one; however, a limitation to the size in used: 1. SELECT. FROM WHERE. LIKE a% AND.number_of_letters = 7 2. SELECT. FROM WHERE. LIKE a_% AND (.number_of_letters 3 AND.number_of_letters 8) 3. SELECT. FROM WHERE. = 1 AND. LIKE a% AND.number_of_letters = 7 4. SELECT. FROM WHERE. = 1 AND. LIKE a_% AND (.number_of_letters 3 AND.number_of_ letters 8) The result of the second database designs testing is provided in table 2 where can be noticed similar tendencies. However, comparing results of the first and second research, we can notice the decrees of the query execution time. It shows that length storage can optimize the SQL query execution time for a search when the desired length of the is known. Table 1. Results of testing with the first database design Site of the 10 s s s s Table 2. Results of testing with the second database design Site of the 10 s s s s technologijos mokslai Informatika Research of Database Design No. 3 The third design has more differences in comparison to the first one, as it has an additional table to store letters of the s. Therefore, more complex SQL queries to join multiple tables have to be used; however, similar queries to the first and second design should work either: 1. SELECT. FROM WHERE. = 1 AND. LIKE a% AND.number_of_letters = 7 2. SELECT. FROM WHERE. = 1 AND. LIKE a_% AND (. number_of_letters 3 AND.number_of_letters 8) 3. SELECT. FROM letter, WHERE._id = letter._id AND (letter.letter = a OR letter.letter = b ) AND. = 1 group by letter._id HAVING count(letter._id) 1 4. SELECT. FROM letter, WHERE._id = letter._id AND (letter.letter = a OR letter.letter = b OR letter.letter = r ) AND. = 1 group by letter._id HAVING count(letter._id) 3 As results of this test show, (see 3 table) the execution time of complex queries is bigger comparing to queries which do not use the table letter. Another interesting thing which was noticed by comparing the 3 rd and 4 th queries is the execution time when there are 1000 s in the database the 4 th query is faster when there are up to 1000 s, while the 3 rd query is faster when the number of s in the database is This leads to fact that the speed of these queries is dependent on the in the database. Research of Database Design No. 4 For the experiment with the fourth database design, different types of SQL queries were tested as well: 1. SELECT. FROM WHERE. = 1 AND. LIKE a% AND.number_of_letters = 7 2. SELECT. FROM WHERE. = 1 AND. LIKE a_% AND (.number_of_letters 3 AND.number_of_ letters 8) 3. SELECT. FROM letter, WHERE.number_of_letters = 7 AND._id = letter._id AND ((letter.letter = a and letter.position=1) OR (letter.letter = b AND letter.position=2)) AND. = 1 group by letter._id HAVING count(letter._id) 1 4. SELECT. FROM letter, WHERE.number_of_letters = 7 AND._id = letter._id AND ((letter.letter = a and letter.position=1) OR (letter.letter = b AND letter.position=2) OR (letter.letter = i AND letter.position=3)) AND. = 1 group by letter._id HA- VING count(letter._id) 2 This database design allows us to search for s by a very specific letter pattern as can be seen from the tested queries. Meanwhile, the execution time of these queries (see 4 table) is noticeably bigger. This leads to the fact that this kind of search can be useful in certain situations; however, it should not be used for a simple search as it can decrease the performance of the system. Summary of the Research As can be seen from research, the time to perform simple queries does not depend on the structure of the database, but only on how much data there is inside. The ascent of time to perform more complex queries can be seen with 5000 s, which generally gives more than 40,000 rows in the database. The average performance of any of the requests depending on the design can be seen in the chart below. The second type gives the best performance, because it is possible to more accurately describe the request, but these features are not enough. The fourth type requires too much time to process their full opportunities. The most Table 3. Results of testing with third database design Site of the 10 s s s s Table 4. Results of testing with the fourth database design Site of the 10 s s s s ISSN JAUNŲJŲ MOKSLININKŲ DARBAI. Nr. 2 (40) Fig. 5. Comparison of execution time in proposed database designs balanced result is provided by the third type, because of quite exact queries that do not require too much time to perform. No relationship between query execution time and number of returned results was noticed. The query execution time is more dependent on the number of records which must be examined. Therefore, an effective way to reduce the search execution time is length limitation, which reduces the number of examined records in the database as well as query execution time. Conclusions 1. The string search algorithm can vary in its complexity and require text analysis, while SQL-based rational database allows for a search according to a known pattern of set of letters more easily. 2. To design certain functionality for search algorithms, limitations can be used depending on database structure. Therefore, the design of database schema is very important to ensure all the functionality and performance of the system. 3. All proposed database designs allow for search in the database and the results of tested SQL queries give suitable results (all s which meet the pattern of set of letters are found). Therefore, the main choice of database design should be done according to the query execution time to get a certain set of s from the database. 4. The storage of letters in different tables for all the s in the gives an opportunity to search for s which have a set of letters. However, this database design is space consuming, as all s have more space in the database, and the execution time is bigger comparing to database designs without a separate table for letters storage. Therefore, this database design should be used just for specific problem solving to ensure good system performance. References 1. R. Rivest, 1977, On the worst-case behavior of stringsearching algorithms. SIAM J on Computing, 6: R. Baeza-Yates, 1989, Improved string searching. Software-Practice and Experience, 19(3): D.E. Knuth, J. Morris, and V. Pratt. Fast pattern matching in strings. SIAM J on Computing, 6: R. Boyer and S. Moore, 1977, A fast string searching algorithm. C.ACM, 20: Ricardo A. Baeza-Yates, 1989,String Searching Algorithms Revisited, Proceedings of the Workshop on Algorithms and Data Structures, p.75-96, August Jeffrey D. Ullman, /focs/ch08.pdf 10. Okunis R., Makarevičius M. Populiariausios duomenų bazių valdymo sistemos. 11. Paliulis E., Baronienė R., 2010, Duomenų bazių valdymo sistemų analizė. Jaunųjų mokslininkų darbai. Nr. 3 (28). P An introduction to the SQL procedure, http://www. ats.ucla.edu/stat/sas/library/nesug99/bt082.pdf . 13. SQL LIKE Clause, http://www.tutorialspoint.com/ sql/pdf/sql-like-clause.pdf . Research of Word Search Algorithms based on relational database Oleksij Volkov, Simona Ramanauskaitė Summary Word games are a simple and fun way to spend time. It helps to improve a person s vocabulary, expand erudition, train memory and intelligence, and develop logic and associative thinking. There exist different types of games and different types of games require a different game logic, and different -finding algorithms. The optimization of search algorithms is required to designa fully-functioning and efficient system. The aim of this work is to research the efficiency of different kinds of search algorithms, using SQL-based relational database. In this work, classic string search algorithms are analyzed and a different architecture of relational database structure was proposed to design a functional and efficient search system. An efficiency analysis of proposed database designs was also analyzed to obtain metrics to define which architecture would be the most suitable for th design of search system. Key s: SQL; search; database. 142 technologijos mokslai Informatika Reliacinės duomenų bazės Struktūra paremtų žodžių paieškos algoritmų tyrimas Oleksij Volkov, Simona Ramanauskaitė Santrauka Žaidimai žodžiais yra paprastas ir linksmas laiko praleidimo būdas, kuris padeda asmenims plėsti savo žodyną, gerinti erudiciją, lavinti atmintį ir mąstymą, vystyti loginį mąstymą. Šiuo metu egzistuoja daug skirtingų žaidimų žodžiais, kurie reikalauja ir skirtingo žaidimo algoritmo, tinkamo žodžio radimo algoritmo. Todėl kuriant informacines sistemas, padedančias surasti ieškomą žodį pagal žinomus kriterijus, yra labai svarbu tinkamai parinkti ar optimizuoti žodžių paieškos algoritmą. Šio darbo tikslas ištirti žodžių paieškos algoritmų, naudojančių SQL paremtas reliacines duomenų bazes, našumą. Darbe apžvelgiami klasikiniai teksto paieškos algoritmai ir analizuojamos skirtingos reliacinių duomenų bazių architektūros, leidžiančios aprašyti turimą žodžių žodyną ir atlikti žodžių atranką naudojant skirtingo tipo paiešką. Darbe tiriamas pasiūlytų duomenų bazės schemų našumas, siekiant įvertinti tinkamiausią duomenų bazės architektūrą žodžių paieškos sistemai, kurioje gali būti vykdomi skirtingo tipo žodžių paieškos algoritmai, realizuoti. Prasminiai žodžiai: SQL; žodžių paieška; duomenų bazė. Įteikta
Related Search
Similar documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks