Searching for Text (Part II)
In the first part of this series, we took a look at various methods that help us to index text and to facilitate text searches. The approaches were focused on SQL databases. Provided you don't get nervous about abandoning SQL queries, I can show you another way to build an index and to query it for your favourite phrase. Prepare to meet Lucene!
Collecting and Converting Files
I've already said that organising your data is of premium importance. This starts with a smart directory structure (or database design) and ends with proper document summaries and tagging data or having sensible categories. No matter how you store your files, if you start to build an index you need to collect them somehow. A lot of documents are stored in directory trees (on file servers for example); some servers hold thousands or millions of files. Indexing this much documentation requires a good strategy - e.g., you don't want to keep rebuilding indexes for data that hasn't changed.
When it comes down to indexing you need plain text; this means you have to convert your file content. Since there is no such thing as plain text anymore, we also have to think about the character encoding. Will it be sufficient to use 7-bit US ASCII? Do we use ISO-8859-1 or even ISO-8859-15 because we'd like to have the Euro sign? It might be a good idea to use UTF-8. It is a variable-length character encoding for Unicode that 'feels' like 7-bit US ASCII as long as no special character is encountered and has some extra room for a lot of strange letters. This is the reason why a lot of people use it. Be aware that if you use an Unicode encoding in one place, you should use it everywhere. It makes life a lot easier.
When looking at a pile of a hundred thousand files on a server you probably
don't want all of them; the ones you want either contain text or can be
converted into text. A picture might contain text fields (such as JPEG EXIF
tags or PNG iTXt/tEXt chunks), but often it is useful to stick to text file
formats. How do we choose? Well, the easiest way is to look for the file
extension. This isn't very accurate since the extensions ".jpg" and ".jpeg"
both most certainly mark JPEG files, but determining a file type by
analysing its content is not easy (the GNU tool file
does a
good job if you choose to go that way.) For now we will stick to extension
for the sake of simplicity.
So what do we do now? Well, we follow the UNIX tradition and split the problem into smaller, manageable parts. First we do a little walking and collecting - a filesystem walk to be exact. We need to create a list of interesting files for later indexing. Let's start with the syntax of the config file that lists all interesting extensions.
extension = [ ".doc", ".htm", ".html", ".odp", ".ods", ".odt", ".pdf", ".txt", ".xls" ]; nice = 13; output = "filelist.txt";
The config option extension lists all interesting extensions. nice is the nice level for your program. A filesystem walk creates more than enough load, so we'll let other processes have the CPU first. output tells our program where to write the list of interesing files. It will be written to the file filelist.txt. The syntax of the config file originates from the C/C++ library libconfig. It parses the file for use and extracts the options. The parser isn't that difficult to write, but we have some more complex tasks ahead. Later in the article, I'll provide a link to the source code for filelist, the helper code files helper.h and helper.cc, the command line options configuration filelist.ggo and a Makefile to build everything. You will need the Boost filesystem and iostreams libraries as well as the gengetopt skeleton generator from the GNU project. gengetopt and especially the Boost libraries are extremely useful to make your life in C/C++ a lot easier.
Lucene and its Ports
The Apache Lucene project is a collection of tools for building software that features search functions. The core component is Lucene itself, which is a Java-based indexing and search library. The project offers additional code to build web search applications and dealing with metadata. Lucene has been ported to Perl (PLucene), Python (PyLucene), C (Lucy), and C++ (CLucene), so you can access a Lucene index with your favourite tool chain. The Java code also contains classes that facilitate the import of documents by using stream classes. Creating and maintaining an index is very easy, and the index format can be used across different platforms and accessed by various ports of Lucene without the need for conversion.
We'll focus on the C++ port of Lucene in order to index our documents.
Indexing with CLucene
CLucene introduces some concepts that should be understood before using the API. Every object that is to be indexed is called a document; in an ideal word this document consists of pure text to make the work of the indexer easier. CLucene analyses the content and uses different algorithms to extract useful words or tokens from the indexed document. The analysers are classes of their own. They contain the algorithms such as dividing by white spaces, using stop words, replacing special characters, and other methods.
Every document can be described by a list of fields with content. The fields can have arbitrary names. These names are used to access the content of the field, very similar to associative arrays or hashes.
Field | Content -----------+------------------------------- title | My notes from the conference author | R. Pfeiffer content | Lots and lots of text, ... timestamp | 1207958005 type | UTF-8 Unicode English text ... |
This means that the document is a container for all kinds of information added by the fields. It is not necessary to put the whole content into the index, but it helps if your aim is to have a full text search. It is also useful to pack additional metadata into the document. The search covers all fields or a selection of them, so that you can tune your search later in order to reduce the number of hits.
The index as a whole resides in a directory and is completely managed by CLucene; there are no user-serviceable parts, all access is done through CLucene (or one of the other ports). Moving the index around is as easy as copying a file. You can also copy the index directory between different platforms and it should work without conversion.
The CLucene library works with Unicode. The strings are typed with wchar_t *, which means they consist of wide character literals (marked with a L such as L'x'). Therefore I suggest that you use Unicode when interfacing with CLucene.
Simple Indexing Strategy
How do we get from our list of interesting files to a properly-filled index? Simple, we need a little strategy.
- Walk through the list of files one by one
- Check if the file has been modified since the last indexing run
- Determine file type (by extension or more elaborate means)
- Convert file to plain text with a given character encoding
- Add file name, content, timestamps, file type, etc. to index
Now we know what our next piece of code should do. I will describe some important aspects of the task.
Converting Documents to Plain Text
I've already said that we need text; this means we have to convert PDF, PostScript, and anything that is not text into text. We should also ensure that the end result is suitably Unicode encoded (UTF-8 for example). In order to avoid doing this in our C++ code, we rely on external programs. This isn't very elegant, but it helps to maintain a flexible approach to future data formats that we wish to index. The external helper tools are defined by file extension in a configuration file.
// List of known file extensions and their converters to // plain text pdf=(pdftotext -q -eol unix -enc UTF-8 $IN - > $OUT) ps=(pstotext $IN | iconv -f ISO-8859-1 -t UTF-8 -o $OUT -) doc=(antiword $IN > $OUT) html=(html2text -nobs -o $OUT $IN) htm=(html2text -nobs -o $OUT $IN) odp=(ooo_as_text $IN > $OUT) ods=(ooo_as_text $IN > $OUT) odt=(ooo_as_text $IN > $OUT) php=(html2text -nobs -o $OUT $IN) rtf=(unrtf --nopict --text $IN > $OUT) txt=(cat $IN > $OUT) xls=(py_xls2txt $IN > $OUT) xml=(cat $IN > $OUT)
$IN is the file to be converted. $OUT is the output file, which will be a temporary file. The extension is to the left of the equal sign. The commands inside the brackets will be executed by the indexer prior to feeding the content to the CLucene index. An alternative would be to use classes that understand file types, read them, and convert them; the Strigi project has code for this. They call it JStreams. JStreams provide a standardised interface for accessing the contents of different file types. The approach with external tools is a bit more generic. Note that all OpenOffice document formats can be converted with a single tool (ooo_as_text is still not released but part of the OOoPy project. Kindly ask the author of OOoPy for a copy if it's not contained in the downloads).
The configuration file is parsed by a parser generated with Boost's Spirit library. The whole parser is defined by using templates. struct filter_grammar in helper.cc contains the full rules for the parser. Once you understand how the templates work the Spirit library is a convenient way to build your own parsers.
Maintaining a Database of Timestamps
When you consider the documents stored on a file server, you will realise that most of them don't change very often. Maybe some change more often when someone works on them. Once you have a kind of "library", most of the documents will stay the same. So it's a good idea to keep track of the modification time of the files. By doing this we can decide to update the document depending on its last modification timestamp. This saves a lot of I/O when indexing a large collection of documents.
The indexer keeps track of the timestamps by a SQLite database. The table could also be replaced by a hash, but I wanted to do more in SQL and then didn't do it. indexer.cc has the creation statement.
CREATE TABLE IF NOT EXISTS fileaccess ( filename TEXT PRIMARY KEY, mtime INT(8) )
SQL is overkill for that, but that's why we use SQLite. We use a transaction during the whole indexing process. The changes are not committed unless all documents are indexed without error.
Creating and Writing to a CLucene Index
Before you can do anything with an index you have to open it. This is very similar to other resources such as files or sockets. However, before you can open or create it, you have to select an analyser object. This analyser defines how you wish to process the data fed to the index. Since we don't know exactly what we are indexing, we'll use a whitespace analyser. Then we open the index.
// This is the analyser we use. WhitespaceAnalyzer analyser; // Initialise CLucene index writer IndexWriter::IndexWriter index_repository( args_info.index_arg, &analyser, new_index, true );
args_info.index_arg is a string containing the directory where the index will live. &analyser is our analyser and new_index is a boolean flag that indicates whether we open an existing index or create a new one. The last argument is described with closeDir in CLucene's Doxygen documentation.
The index can now be filled with document objects. A CLucene document is simply a container for fields describing something.
// Fields we want to put into the index lucene::document::Field *field_filename; lucene::document::Field *field_file_content; lucene::document::Field *field_mtime; lucene::document::Field *field_type; ... // Create Lucene document for adding to index directory. file_document = new lucene::document::Document; // Add fields to document object. if ( file_has_content ) { file_document->add( *field_file_content ); } file_document->add( *field_filename ); file_document->add( *field_mtime ); file_document->add( *field_type ); index_repository.addDocument( file_document, &analyser ); ...
You can add as many fields as you like. Adding metadata is a good idea since you probably don't want to search for content all of the time. The addDocument() method adds the document to the index repository.
Since CLucene maintains the index repository, it's also a good idea to call the optimise method if you change a lot of data.
// Optimising the index should be done after there were changes to the index. index_repository.optimize(); // Close index. index_repository.close();
That's the short introduction to CLucene, and these are the basics you need to get started. The library can do many more things.
Reading from a CLucene index
We won't read from the index (yet), but reading is as easy as writing. You open the index repository, send search queries and retrieve documents. A fragment of code performing a search and retrieving all hits looks like this.
using namespace lucene::index; using namespace lucene::analysis; using namespace lucene::util; using namespace lucene::store; using namespace lucene::document; using namespace lucene::search; using namespace lucene::queryParser; wstring search_string = L"Where is it?"; lucene::index::IndexReader *index_reader; lucene::search::IndexSearcher *index_searcher; Query *index_query; Hits *index_hits; WhitespaceAnalyzer analyser; index_reader = IndexReader::open( args_info.index_arg ); index_searcher = new IndexSearcher(index_reader); index_query = QueryParser::parse( search_string.c_str(), L"content", &analyser ); index_hits = index_searcher->search(index_query); if ( index_hits->length() > 0 ) { for( long i=0; i < index_hits->length(); i++ ) { Document &doc = index_hits->doc(i); wcout << "FOUND: " << doc.get(L"filename") << endl; } } delete index_hits; delete index_query; index_reader->close(); delete index_searcher;
It is important to use the same analyser as the indexing process did. In our case this is the WhitespaceAnalyzer again. IndexReader::open() opens the index, QueryParser::parse() performs the search, and CLucene returns Query objects whose content, the Document objects, can be retrieved. As you can see, all strings are wide strings, so using Unicode really is important.
If you do some debugging with your created CLucene indices, you might want to try Luke, the Lucene Index Toolbox. It provides a Java tool that can display the contents of an index. You can browse the documents, look at the fields, and perform search queries.
The Code
Since this article is already much longer than anticipated, I'll just provide a link to the complete tar archive of all the code I've shown above. It also contains a Makefile to facilitate compiling. I used the GCC/G++ 4.1.2 for development (and I'd like to see what Intel's compiler says about my code). If you use a Debian system you will need the following packages:
- gengetopt
- libboost-filesystem-dev
- libboost-iostreams-dev
- libboost-regex-dev
I compiled SQLite, libconfig++, and CLucene from source, because I wasn't happy with the packages in Debian Etch. The newer SQLite version is especially interesting, since its provides a new API to some functions (marked with ..._v2()). If you want to compile the Boost library as well, you have to add the path to its include files (which is /usr/local/include/boost-1_35 for the current version if installed from source). Since the Boost libraries consist mainly of templates, the compile process is fairly short despite the size of Boost's distribution.
Test Run with a little "Benchmark"
And now for the final question: Why all the fuss? How fast is it? Do we need to care which port we use? I can't answer these questions. All I can do is run the indexer over a list of files (this is not a benchmark, and has no statistical significance.) The directory looks like this:
rpfeiffer@miranda:/nfs/Bibliothek$ du -h --max-depth=1 1.3M ./Lyrics 703M ./Security 0 ./Biometrie 16M ./Sysadmin 172K ./Misc 403M ./Teaching 12M ./Programming 57M ./Hardware 3.9M ./Reports 7.3M ./Networks 92K ./Chaos 32K ./Gfx 12K ./UTF-8 23M ./VoIP 1.8M ./Science 1.2M ./Manuals 1.2G . rpfeiffer@miranda:/nfs/Bibliothek$
The documents reside on our file server in the office and are accessed via NFSv3 and Gigabit Ethernet. The machine running the indexer is a Core2 Duo with 2.13 GHz and 2 GB RAM. filelist counts 539 interesting documents (by using the extensions I listed earlier). Let's try to run the indexer and create a new index. Note that the directory was cached in part due to my du command and that the 539 documents may be less than the 1.2 GB because we only look for specific file extensions.
rpfeiffer@miranda:~/code$ time ./indexer -c ./indexer.cfg -i /var/tmp/i -n 1 -l ./filelist.txt real 1m48.767s user 1m12.337s sys 0m17.849s rpfeiffer@miranda:~/code$
The index looks like this:
rpfeiffer@miranda:/var/tmp/i$ ls -lh total 5.2M -rwxr-xr-x 1 rpfeiffer rpfeiffer 4 2008-04-18 23:56 deletable -rwxr-xr-x 1 rpfeiffer rpfeiffer 5.2M 2008-04-18 23:56 _gk.cfs -rwxr-xr-x 1 rpfeiffer rpfeiffer 28 2008-04-18 23:56 segments rpfeiffer@miranda:/var/tmp/i$
So, the indexer did something and stored something to disk. An inspection with Luke shows familiar documents and content.
A Word about Alpha Code and Bugs
Please be aware that the code shown in this article is of alpha quality. The source contains some dead code and still needs some improvement (especially the execution of the external helper binaries). So far, it works, and it doesn't segfault that often anymore - but that's about it. It is built on solid and stable libraries, but it isn't meant to be in production status (yet). It's also a bit messy and should be cleaned up, because it took me a while to understand the libraries that I used. If you have suggestions, just send patches - that's what the GPL is for. If you have no code but a lot of good ideas, let's hear them! Preferably in an article for one of our next issues.
Useful resources
- Boost C++ libraries
- CLucene - a C++ port of Lucene.
- gengetopt
- libconfig - a C/C++ configuration file library.
- Lucene - provides Java-based indexing and search technology.
- Luke - Lucene Index Toolbox.
- SQLite - a lightweight database library.
- Strigi - a desktop and indexer independent desktop search engine.
Talkback: Discuss this article with The Answer Gang
René was born in the year of Atari's founding and the release of the game Pong. Since his early youth he started taking things apart to see how they work. He couldn't even pass construction sites without looking for electrical wires that might seem interesting. The interest in computing began when his grandfather bought him a 4-bit microcontroller with 256 byte RAM and a 4096 byte operating system, forcing him to learn assembler before any other language.
After finishing school he went to university in order to study physics. He then collected experiences with a C64, a C128, two Amigas, DEC's Ultrix, OpenVMS and finally GNU/Linux on a PC in 1997. He is using Linux since this day and still likes to take things apart und put them together again. Freedom of tinkering brought him close to the Free Software movement, where he puts some effort into the right to understand how things work. He is also involved with civil liberty groups focusing on digital rights.
Since 1999 he is offering his skills as a freelancer. His main activities include system/network administration, scripting and consulting. In 2001 he started to give lectures on computer security at the Technikum Wien. Apart from staring into computer monitors, inspecting hardware and talking to network equipment he is fond of scuba diving, writing, or photographing with his digital camera. He would like to have a go at storytelling and roleplaying again as soon as he finds some more spare time on his backup devices.