Using Apache Tiki for Entity Extraction and OCR


One of the greatest features of Tiki is the ability to extract text out of your documents. Tiki is unique in that it can extract from documents, images and files. This allows you to build reference data to point to files and when combined with search capabilities allows you to search thousands of documents in seconds. OCR converts documents to text format so that you can find particular mentions within a file. Think of Tiki as a method of building a card catalog that allows you to find files of interest using keywords or particular phrases.


Another features of Tiki is metadata extraction of files. So when a file is processed, Tiki can read things such as the authors name, computer where the file was written, date and timestamp information. Tiki is language aware and can determine what language a document was written in as well as permissions, title and metadata descriptions. This is really useful when looking for very specific types of content and you decide you want to view just the information created by a certain individual.


Tika is unique in that it also comes with a RESTful AIP and the product is mime aware, meaning you will be able to process MIME encoded information as well as standard text. Tika also can extract the text from images such as powerpoint presentations or common image file types.

Tika can be installed or there is a dockerized version that can make installation simple and easy.


In short you should consider the Apache Tiki project when you need to extract text from documents and files. The project is stable, has improved greatly and is a great addition to any platform where you have large amounts of documents.

0 views0 comments

Recent Posts

See All