How can I install Apache Tika on Ubuntu 22.04|20.04|18.04?. Apache Tika is an Open source toolkit that detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). Tika is very useful for search engine indexing, content analysis, translation e.t.c.
What is new in Apache Tika 2.2.x
- Add support for OneNote files downloaded from O365
- Improve extraction of embedded files from MSOffice files created by non-Microsoft tools
- Added back ability to ignore load errors in TikaConfig
- Fix logic bug in PipesServer that prevented concatenation of content from attachments
- Fix default logging in tika-app in batch mode
- Fix race condition when starting multiple forked servers on multiple ports
- Add metadata item for whether or not a PDF has a collection/is a Portfolio PDF
- Add detection of JPEG XL, MARC, ICC profiles, NES-ROM file types
- Add optional fetch ranges to FetchEmitTuple to allow range fetching from,e.g. http or s3
In this post, we will discuss the installation of Apache Tika on Ubuntu 22.04|20.04|18.04 LTS.
Apache Tika dependencies
What you need to build and install Apache Tika on Ubuntu 22.04|20.04|18.04 LTS are:
- Java Runtime Environment (JRE)
- Apache Maven
We will install these dependencies before we can download and install Tika on Ubuntu 22.04|20.04|18.04 Linux system.
Step 1: Install required dependencies
Start by ensuring you’re running an updated Ubuntu Desktop / Server.
sudo apt update
sudo apt -y install wget curl vim unzip
Step 2: Install Java on Ubuntu 22.04|20.04|18.04
As from Tika 1.19, build from Java 11 is supported. You can install Java on Ubuntu using the following commands:
sudo apt install -y default-jdk
Confirm installed version of Java:
$ java --version
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment (build 11.0.13+8-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.13+8-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)
Step 3: Install Apache Maven
Install Apache Maven by following our guide:
Step 4: Download and Install Apache Tika
Download latest Apache Tika from the Downloads page.
export VER="2.2.1"
wget https://archive.apache.org/dist/tika/${VER}/tika-${VER}-src.zip
Unzip the downloaded file.
unzip tika-${VER}-src.zip
Change to new folder and run mvn install
cd tika-${VER}
mvn install
Sample installation output.
Wait for the installation to finish then test Tika within its base directory.
Reference: http://tika.apache.org/2.2.1/gettingstarted.html