XPdfLauncher
XPdfLauncher
Rating (5)
Reviews: 1
Category: Utilities & tools
XPdfLauncher

Description

It is common for many data providers, including in particular government agencies or departments, to publish public record data in PDF format. Often these reports are simply ‘line printer’ images exported to PDF, and are therefore made up entirely of text fields. If one browses one of these documents with Adobe Acrobat, one can highlight one page at a time (Ctrl-A, or ‘Select all’), copy, and paste the text content to Notepad or some other text editor. If the document in question is 4000 pages long, this is an unrealistic method of extracting text.

The usual purpose of extracting the text contents of PDFs is to populate spreadsheets or database tables. In many cases the step following the text extraction is to run a custom parser or macro to organize the data in a more suitable manner. XPDFLauncher is written for ‘power users’ that work on this kind of task routinely.

There are a number of commercial programs available for extracting text from large documents. This is a ‘free’ solution for those that can’t justify such expenditures.

Xpdf is a set of open source programs for viewing PDF files, or converting PDF files into various export formats. These command line applications can be downloaded from www.xpdfreader.com. Within this site, the link ‘Download the open source Xpdf tools’ leads to a set of download options for various operating systems and machine architectures. Within this page is a table named ‘Download the Xpdf tools:’. One entry listed within this group is ‘Windows 32/64-bit: download’. Clicking on this link saves a .zip file to the ‘Downloads’ folder or to another folder specified by the user. The .zip file contains a folder… extract this folder to an appropriate directory to make the programs usable.

One of the programs in this collection is named PDFToText.exe. This program extracts text contained in PDF files and writes it as ASCII text to .txt files. XPDFReader.exe is free under most circumstances - further details are provided on the website. PDFToText.exe is executed from the console - it is not installed via the installer and it does not have a graphical user interface. XPDFLauncher makes this more convenient to use by offering a Windows interface for composing the console command line.

The command line consists of the command, optional parameters, the input file name, and the output file name. Depending on where one is storing their files and the directory of the command, these can have long path names. XPDFLauncher functionality allows one to navigate to and ‘remember’ these paths and file names. This is useful first for converting multiple files relatively quickly, particularly if these files are scattered around in multiple folders, or are contained on removable media. It is also useful for converting the same input file to separate outputs with various options, if it is desirable to try various settings until the most useful one is identified.

‘Windows Store’ programs will not launch command line programs directly. This application formats the command in a block of text, which the user can copy to the clipboard. Once running the ‘Command Prompt’ console, one can paste this text and press Enter to run the command. Previewing the text allows the user to edit the command line to add or modify parameters as desired.

In situations where a path name contains blanks, the program automatically ‘quotes’ the full file path. If, for example, the program is in “C:\Glyph and Cog\XPDF\PDFToText.exe” quotes are necessary around the full path. Quotes aren’t necessary if the command is C:\GlyphAndCog\XPDF\PDFToText.exe. This rule also applies to the input and output file paths.

Background

Those working on ‘gigs’ via Upwork.com, Freelancer.com, and Fiverr.com may run across various requests for PDF-to-text conversions. Converting PDF files to text is ‘easy’ if one is dealing with small amounts of content for single instance conversions.

This is less workable when data volumes are large, the formats are complex, the file publication cycle is frequent, or there are ‘defects’ in the published content, such as redactions. Someone working on a ‘data science’ analysis project dealing with gigabytes of data needs more powerful tools. Some reports are published by multiple local entities, therefore the user has to download and convert dozens or hundreds of files more or less piecemeal. Public documents may be refreshed several times a day, such as with court dockets or building permits. In such circumstances either one person is running conversions frequently, or different people are assigned that responsibility over time. This application is intended to cut out a lot of manual entry and reduce the potential for mistakes.

This program makes it convenient to store various settings and then quickly execute runs when files have changed or new files are available. XPDFLauncher makes Windows user interface elements such as folder and file selection available to make command line composition ‘friendlier’.

Setup

XPDFLauncher will not be of any help until the XPDFReader product is downloaded from www.xpdfreader.com. Since this is a third party product all further information on XPDFReader and tools installation and operation is provided from that site. In general this is an open source product so is free to use. The usual disclaimers apply: use entirely at one’s own risk.

Operation

When XPDFLauncher loads, a single form showing the command executable file path, a drop-down listing various options, an input file name, an output file name, and a complete command line are displayed. If the program has been run before, the setting from the previous run are restored to the respective fields. The first time the program is run, all fields are blank.

If the executable path is blank, click on the Select .EXE Path button to bring up a folder selection dialog. Navigate to the folder that contains the PDFToText.exe command. This is a folder selection option, not a file selection option, so the name of the file will not appear. Once the folder is selected, the path is displayed in the EXE path text field.

Select an option from the Options drop down. If no option is selected none is supplied on the command line. Generally the ‘Simple’ option is the best one to try first. One point of this application is that is easy to experiment with alternatives.

Click on the Select Input File Name to select a .PDF file for import. Once the file is selected the file name is saved on the Input File Name text box.

Click on the Select Output File Name to select a .txt file for output. If the output file name does not exist, type in the name to use and click ‘Save’. If an existing file is selected, it will be overwritten.

At the point where the program path, input path, and output path are all supplied, the command line is populated with the complete command. If an option is selected, it appears immediately after the .exe command.

This text line can be edited if the user needs to supply further parameters. These parameters are described in the PDFToText.exe documentation.

Click on the ‘Save to Clipboard’ button to copy the command line to the clipboard.

From the ‘Windows System’ folder displayed with the Start menu, select the ‘Command Prompt’ program to open the console. Right click within the command prompt screen and select Paste. This will past the text from the command line into the console screen. To run the program, press Enter.

Developer Notes

This program is written and published to soothe a collection of irritants.

Open source programs are often simply console apps and pay little attention to user interface. They may be good at what they do, but they are often embedded in scripts and run in batch processes. This program demonstrates, among other things, how a GUI frontend might be structured for a console application. Users get to experiment with various options without having to copy and paste various file paths or other settings piecemeal. This avoids creating a command line with conflicting, ignored, or redundant parameter settings.

Searching for PDF to text conversion utilities brings up all kinds of programs, most of which are associated with some kind of price. They may or may not produce a satisfactory output. There is huge amounts of variability in the inputs, and the desired outputs are equally driven by user expectations and habits. One can try this program first, before spending any money, to see if it can solve the problem. If someone has already tried an unsatisfactory commercial product, this might solve the problem at no additional cost.

XPDF is a C++ program and runs reasonably fast. Other solutions written in Python (for example) are considerably slower, regardless of their effectiveness. Text extraction from a 4000 page file will typically take two or three seconds on a 64-bit computer with 8Gb of RAM.

While the XPDFLauncher program is a Microsoft Store product and is checked for malicious content, the XPDFReader site is open source and outside of the control of the Microsoft Store. Users should first make sure all their ‘critical updates’ have been applied, so that the Defender Security Center filters are current. Once the .zip file has been downloaded, the user can right click on the file and run a security scan. These are reasonable precautions for avoiding system compromise.

  • PDF Text Extraction Utility
  • GUI Frontend to CLI executable
Product ID: 9NM816CZGB9C
Release date: 2018-11-06
Last update: 2022-03-12