MindTouch Developer Center > MindTouch Deki > FAQ > File Management > How do I...Index file attachments?

Out of the box, Deki Wiki indexes the following file attachments.

  • Microsoft Word (.doc/.docx)
  • Microsoft PowerPoint (.ppt/.pptx)
  • Microsoft Excel (.xls)
  • Adobe PDF (.pdf)
  • HTML (.xhtml/.html/.html)
  • OpenOffice Document (.odt)
  • OpenOffice Presentation (.opt)
  • Text files (.pl/.c/.h/.inc/.php/.cs/.txt/.csv/.xml/.xsl/.xslt)

Adding more content filters

Support for other attachments is easy to add, all you need is an application or script that reads from standard in a file and write to standard out the text contents of the file.  The output should only contain information that indexable.  Everything else, such as formatting and meta-data, should be omitted.

Once you have an application/script to convert the attachment to text, follow these steps:

  1. Locate your mindtouch.deki.startup.xml file.
  2. Locate the <indexer> element.
  3. Add a <filter-path> element.
  4. Add the extension attribute to specify the document extension applicable for filter (if more than one extension applies, create as many <filter-path> elements as needed).
  5. Set the contents of the <filter-path> element to specify the location of your application/script.  If left blank, Deki Wiki will consider the attachment to be already in text format.
  6. Add an optional arguments attribute.  Deki Wiki will pass this string as the command line when invoking your application/script.  Use $1 as a placeholder for the extension of the attachment to be indexed.

Windows-only: using mindtouch.deki.filter

Deki Wiki contains an executable named mindtouch.deki.filter.exe which uses Microsoft's IFilter interface to convert attachments to plain text.  If you're using Microsoft Indexing Service or Windows Desktop Search you probably have a set of IFilters already installed.  To enable mindtouch.deki.filter.exe, follow these steps:

  

  1. Locate your mindtouch.deki.startup.xml file.
  2. Locate the <indexer> element.
  3. Add a <filter-path> element.
  4. Add the extension attribute to specify the document extension applicable for filter.  You can set extension="*" if you want to use mindtouch.deki.filter.exe for all file extensions.
  5. Set the arguments attribute to "$1" to configure mindtouch.deki.filter to use the file's extension when choosing an IFilter.  Alternatively, you can use the arguments attribute to force a particular treatment for a file extension.  For example, to treat ".cs" files as text files, set the arguments attribute to "txt".
  6. Set the contents of the <filter-path> element to specify the location to mindtouch.deki.filter.exe (example:  C:\dekiwiki\web\bin\filters\mindtouch.deki.filter.exe).

  

An example of the indexer configuration using mindtouch.deki.filter is below:

<indexer>
  <path.store>C:\DekiWiki\web\bin\cache\luceneindex\$1</path.store>
  <filter-path extension="cs" arguments="txt">c:\DekiWiki\web\bin\filters\mindtouch.deki.filter.exe</filter-path>
  <filter-path extension="sql" arguments="txt">c:\DekiWiki\web\bin\filters\mindtouch.deki.filter.exe</filter-path>
  <filter-path extension="vb" arguments="txt">c:\DekiWiki\web\bin\filters\mindtouch.deki.filter.exe</filter-path>
  <filter-path extension="*" arguments="$1">c:\DekiWiki\web\bin\filters\mindtouch.deki.filter.exe</filter-path>
</indexer>      

  

Additional IFilters can be downloaded from ifilter.org  Also, to browse the list of already installed IFilters you can download IFilter Explorer

Tag page
Viewing 4 of 4 comments: view all
This page is more up-to-date than the Window's install page. I used mindtouch.deki.filter.exe in my filter definitions. I also had to update the IFilter on my server. Adobe's Ifilter version 6 caused problems, but version 8 worked (pdf_ifilter8_64bit_p1_110607.zip package). Ifilter Explorer was useful in confirming a successful filter install.
See the forum thread below for all the challenges and success I encountered getting this critical feature working (And thanks again to PeteE and Corey!). http://forums.opengarden.org/showthread.php?p=12470#post12470
Posted 20:14, 27 Mar 2008
You may want to add an example that uses DOC or PDF, some binary, as the extension. One person in the forums added binary types using "arguments="txt""; perhaps the example above confused him/her.
Suggested example:
<filter-path extension="pdf" arguments="$1">c:\DekiWiki\web\bin\filters\mindtouch.deki.filter.exe</filter-path>
Posted 23:27, 5 May 2008
Out of the box, I'm also having an issue indexing *.pdf files (Created with Adobe 7/8, I believe).

I built my version of deki jay cooke from deb packages (using Ubuntu Hardy), and everything as far as indexabilityof attachments seemed to work fine except on pdf files.

At first, I thought the problem was with html2text -- apparently, it wasn't installed when I updated the package list before my install. No harm, no foul, just installed that package. Was able to successfully convert a PDF file to text using the following from the command line:

pdftohtml -stdout -i -noframes file_name > temp.pdf.html
html2text -nobs file_name > temp.pdf.html.text

After restarting both Apache and Deki, and reattaching a PDF 7.0 file, I still am not able to search the contents of the file from the index. Don't see anything turned on in logging that would help point me in the right direction.

Here's the contents of the pdf2text script in the $deki_install/bin/filters folder:

#!/bin/sh
TEMP=`mktemp`
dd of=$TEMP 2> /dev/null
pdftohtml -stdout -i -noframes "$TEMP" | html2text -nobs - - | sed `/^[\=]\+/ d' |sed 'Sd' | sed 'ld' | sed '/^$/d'
rm $TEMP

Any ideas/pointers of where I should be looking at, or anyone else having issues getting PDFs to index on deki jc/ubuntu hardy?
Posted 21:19, 11 Jun 2008
Just for those who looked in the wrong direction, like me:
The startup.xml-file is not located in the var/www/dekiwiki/config (this file ends with '.in'.
You should have this one:
/etc/dekiwiki/mindtouch.deki.startup.xml
Posted 08:54, 18 Jul 2008
Viewing 4 of 4 comments: view all
You must login to post a comment.