Textricator: an easy data extractor for PDF files

textr logo

Textricator is an interesting tool that you should know. It is open source and is used to extract complex data from PDF documents, without the need for programming knowledge. If you want to know more information about this tool, you can access the official website of the project. From there you will find information and also access links to the tool's code on Github, along with its documentation.

Textricator can extract text from PDF files and generate structured data (CSV or JSON). Something very practical for when working with many PDFs of the same format or a large PDF, and it can even work on OCR documents. The tool looks very good, and was presented at the 2018 Code for America Summit, and developed by Measures for Justice with the aim of helping all those who want to extract this type of data without programming knowledge.

Instead of the programming needs of other alternatives, Textricator allows the user to describe the structure of the document using a yaml file. And so you can extract data from PDF files in almost any layout, including tables, and generate complex reports from tools like Crystal Reports. It's that simple, you order what you want to collect and Textricator does it completely automatically ...

Its developers Joe Hale and Stephen Byrne They have spent the last two years working on the project to be able to extract tens of thousands of pages of data from almost any PDF format. And it can be used from the command line, but there is also a GUI available for convenience. So we encourage you from LxA to use this Tabula alternative (although it is more limited in functions to extract data than the flexible Textricator) and other software similar to it for data extraction.

The content of the article adheres to our principles of editorial ethics. To report an error click here.

Be the first to comment

Leave a Comment

Your email address will not be published. Required fields are marked with *



  1. Responsible for the data: AB Internet Networks 2008 SL
  2. Purpose of the data: Control SPAM, comment management.
  3. Legitimation: Your consent
  4. Communication of the data: The data will not be communicated to third parties except by legal obligation.
  5. Data storage: Database hosted by Occentus Networks (EU)
  6. Rights: At any time you can limit, recover and delete your information.