Tika is a toolkit for detecting and extracting metadata and structured text content from various documents such as PDFs, Word, and HTML. It allows parsing of document files into XHTML output and metadata. Tika uses a ContentHandler interface to parse document streams into SAX events and extract metadata using a Parser interface. It supports many file formats through built-in parsers and uses Apache Lucene for type detection.