AbstractWe present algorithms for efficient searching of regular expressions on preprocessed text, using a Patricia tree as a logical model for the index. We obtain searching algorithms that run in logarithmic expected time in the size of the text for a wide subclass of regular expressions, and in sublinear expected time for any regular expression. This is the first such algorithm to be found with this complexity.
The abstract is also available as a LaTeX file, a DVI file, or a PostScript file.
Categories and Subject Descriptors: E.1 [Data Structures]; F.2 [Analysis of Algorithms and Problem Complexity]; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; I.5.4 [Pattern Recognition]: Applications -- text processing
General Terms: Algorithms
Additional Key Words and Phrases: Digital trees, finite automata, regular expressions, text searching
Selected references
- Alberto Apostolico and Wojciech Szpankowski. Self-alignments in words and their applications. Journal of Algorithms, 13(3):446-467, September 1992.
- Ricardo A. Baeza-Yates. Fringe analysis revisited. ACM Computing Surveys, 27(1):111-119, March 1995.
- Don S. Batory. On searching transposed files. ACM Transactions on Database Systems, 4(4):531-544, December 1979.
- Philippe Flajolet and Claude Puech. Partial match retrieval of multidimensional data. Journal of the ACM, 33(2):371-407, April 1986.
- Edward Fredkin. Trie memory. Communications of the ACM, 3(9):490-499, September 1960.
- Gaston H. Gonnet. Unstructured data bases or very efficient text searching. In Proceedings of the Second ACM SIGACT-SIGMOD Symposium on Principles of Database Systems, pages 117-124, Atlanta, Georgia, 21-23 March 1983.
- Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. In Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms, pages 319-327, San Francisco, California, 22-24 January 1990.
- Donald R. Morrison. PATRICIA -- practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM, 15(4):514-534, October 1968.
- Ronald L. Rivest. On the worst-case behavior of string-searching algorithms. SIAM Journal on Computing, 6(4):669-674, December 1977.
- Wojciech Szpankowski. A generalized suffix tree and its (un)expected asymptotic behaviors. SIAM Journal on Computing, 22(6):1176-1198, December 1993.
- K. Thompson. Regular expression search algorithm. Communications of the ACM, 11(6):419-422, June 1968.