Trouble parsing a PDF corpus

Cortext Manager Q&A forum › Category: Data processing › Trouble parsing a PDF corpus

Carolina Rau asked 1 year ago

Hi, I am new to the platform and I am trying to parse a PDF corpus. I have uploaded a zip file with my pdfs in it and followed the parsing instructions. I have received a error message, which tells me to provide you with this message

Data Parsing did not finish successfully. See the job log for additional information. To obtain help with this issue, ask us at the forum and include the following text:

job id: 423028

3 Answers

0 Vote Up Vote Down

Lionel Staff answered 1 year ago

Dear Carolina,
Yes, thanks and sorry for that. Apparently it is just a matter of accents in the two pdf file names.
Just replace

SISTEMATIZACIÓN-2022.pdf -> Sistematizacion-2022.pdf
Sistematización-2018.pdf -> Sistematizacion-2018.pdf

And it will help!
L

Nathanael Jeune replied 1 year ago

Hi, I have also tried parsing a PDF corpus without any accent, all file names are like “Hemsley-Brown_Sharp_2003.pdf”
I get the following error when I look into the debug : “A filename in your dataset seems to include invalid characters, please remove any accent in the names of the files you are uploading, etc.”

Could you please add in the documentation what are the requirements in terms of file names (or even better, make the parsing more permisive in terms of names)? Thank you very much!

job id: 424616

0 Vote Up Vote Down

Lionel Staff answered 1 year ago

Dear Carolina,
Yes, we should definitively improve the parsing of PDF files and/or update the documentation.
In your new dataset, just replace “-” by “_” in your pdf filenames (for Grima-Farrell_et_al_2011.pdf | Hemsley-Brown_Sharp_2003.pdf | Rycroft‐Smith_2022.pdf …)
I hope it helps!
L

0 Vote Up Vote Down

Carolina Rau answered 1 year ago

Thank you very much Lionel! It is working now 🙂

Cortext Manager Documentation

Learn about Cortext methods and share your experiences