Out of the box, Deki Wiki indexes the following file attachments.
- Microsoft Word (.doc/.docx)
- Microsoft PowerPoint (.ppt/.pptx)
- Microsoft Excel (.xls)
- Adobe PDF (.pdf)
- HTML (.xhtml/.html/.html)
- OpenOffice Document (.odt)
- OpenOffice Presentation (.opt)
- Text files (.pl/.c/.h/.inc/.php/.cs/.txt/.csv/.xml/.xsl/.xslt)
Adding more content filters
Support for other attachments is easy to add, all you need is an application or script that reads from standard in a file and write to standard out the text contents of the file. The output should only contain information that indexable. Everything else, such as formatting and meta-data, should be omitted.
Once you have an application/script to convert the attachment to text, follow these steps:
- Locate your mindtouch.deki.startup.xml file.
- Locate the <indexer> element.
- Add a <filter-path> element.
- Add the extension attribute to specify the document extension applicable for filter (if more than one extension applies, create as many <filter-path> elements as needed).
- Set the contents of the <filter-path> element to specify the location of your application/script. If left blank, Deki Wiki will consider the attachment to be already in text format.
- Add an optional arguments attribute. Deki Wiki will pass this string as the command line when invoking your application/script. Use $1 as a placeholder for the extension of the attachment to be indexed.
Windows-only: using mindtouch.deki.filter
Deki Wiki contains an executable named mindtouch.deki.filter.exe which uses Microsoft's IFilter interface to convert attachments to plain text. If you're using Microsoft Indexing Service or Windows Desktop Search you probably have a set of IFilters already installed. To enable mindtouch.deki.filter.exe, follow these steps:
- Locate your mindtouch.deki.startup.xml file.
- Locate the <indexer> element.
- Add a <filter-path> element.
- Add the extension attribute to specify the document extension applicable for filter. You can set extension="*" if you want to use mindtouch.deki.filter.exe for all file extensions.
- Set the arguments attribute to "$1" to configure mindtouch.deki.filter to use the file's extension when choosing an IFilter. Alternatively, you can use the arguments attribute to force a particular treatment for a file extension. For example, to treat ".cs" files as text files, set the arguments attribute to "txt".
- Set the contents of the <filter-path> element to specify the location to mindtouch.deki.filter.exe (example: C:\dekiwiki\web\bin\filters\mindtouch.deki.filter.exe).
An example of the indexer configuration using mindtouch.deki.filter is below:
<indexer>
<path.store>C:\DekiWiki\web\bin\cache\luceneindex\$1</path.store>
<filter-path extension="cs" arguments="txt">c:\DekiWiki\web\bin\filters\mindtouch.deki.filter.exe</filter-path>
<filter-path extension="sql" arguments="txt">c:\DekiWiki\web\bin\filters\mindtouch.deki.filter.exe</filter-path>
<filter-path extension="vb" arguments="txt">c:\DekiWiki\web\bin\filters\mindtouch.deki.filter.exe</filter-path>
<filter-path extension="*" arguments="$1">c:\DekiWiki\web\bin\filters\mindtouch.deki.filter.exe</filter-path>
</indexer>
Additional IFilters can be downloaded from ifilter.org Also, to browse the list of already installed IFilters you can download IFilter Explorer
See the forum thread below for all the challenges and success I encountered getting this critical feature working (And thanks again to PeteE and Corey!). http://forums.opengarden.org/showthread.php?p=12470#post12470
Suggested example:
<filter-path extension="pdf" arguments="$1">c:\DekiWiki\web\bin\filters\mindtouch.deki.filter.exe</filter-path>
I built my version of deki jay cooke from deb packages (using Ubuntu Hardy), and everything as far as indexabilityof attachments seemed to work fine except on pdf files.
At first, I thought the problem was with html2text -- apparently, it wasn't installed when I updated the package list before my install. No harm, no foul, just installed that package. Was able to successfully convert a PDF file to text using the following from the command line:
pdftohtml -stdout -i -noframes file_name > temp.pdf.html
html2text -nobs file_name > temp.pdf.html.text
After restarting both Apache and Deki, and reattaching a PDF 7.0 file, I still am not able to search the contents of the file from the index. Don't see anything turned on in logging that would help point me in the right direction.
Here's the contents of the pdf2text script in the $deki_install/bin/filters folder:
#!/bin/sh
TEMP=`mktemp`
dd of=$TEMP 2> /dev/null
pdftohtml -stdout -i -noframes "$TEMP" | html2text -nobs - - | sed `/^[\=]\+/ d' |sed 'Sd' | sed 'ld' | sed '/^$/d'
rm $TEMP
Any ideas/pointers of where I should be looking at, or anyone else having issues getting PDFs to index on deki jc/ubuntu hardy?
The startup.xml-file is not located in the var/www/dekiwiki/config (this file ends with '.in'.
You should have this one:
/etc/dekiwiki/mindtouch.deki.startup.xml