Full-text retrieval blog » Blog Archive » Using the Minidx Extract-Text Com component to extract text content from Word, Xls, Pdf... files.

by minidxer on 2008-01-13 12:40:50

Many people are amazed that search engines like Google and Baidu can "find" various files you've placed on servers, such as Word (.doc), Excel (.xls), and PDF files. Additionally, quite a few people have emailed me asking how Minidx File Manager extracts text content from various file formats. While implementing this functionality is more complex on the Linux platform, for Windows users, it's relatively easier to achieve using Microsoft's IFilter interface through the Indexing Service. Minidx supports over 200 file formats by leveraging the IFilter interface. The basic principle of implementation involves writing a COM component that searches for the DLL paths containing the API interfaces corresponding to the respective file formats in the system, and then...