In this article, we’ll learn how to extract plain text strings from a few of the most common file types (PDF, DOCX, XSLX, PPTX) we can expect to deal with on a day-to-day basis as programmers in an enterprise environment.
We’ll briefly review when to use plain text extraction methods over Optical Character Recognition (OCR) text extraction methods, and we’ll discuss some use cases for retrieving plain text in a real-world scenario. Ultimately, we’ll cover a few open-source APIs that are perfect for handling plain text extraction on a one-off basis, at the end we’ll demonstrate a proprietary API that saves time by automatically detecting each different file type before extracting plain text content.