Trouble parsing a PDF corpus

Cortext Manager Q&A forumCategory: Data processingTrouble parsing a PDF corpus
Carolina Rau asked 2 weeks ago

Hi, I am new to the platform and I am trying to parse a PDF corpus. I have uploaded a zip file with my pdfs in it and followed the parsing instructions. I have received a error message, which tells me to provide you with this message

Data Parsing did not finish successfully. See the job log for additional information. To obtain help with this issue, ask us at the forum and include the following text:
job id: 423028
3 Answers
Lionel Staff answered 2 weeks ago

Dear Carolina,
Yes, thanks and sorry for that. Apparently it is just a matter of accents in the two pdf file names.
Just replace

  • SISTEMATIZACIÓN-2022.pdf -> Sistematizacion-2022.pdf
  • Sistematización-2018.pdf -> Sistematizacion-2018.pdf

And it will help!
L

Nathanael Jeune replied 1 week ago

Hi, I have also tried parsing a PDF corpus without any accent, all file names are like “Hemsley-Brown_Sharp_2003.pdf”
I get the following error when I look into the debug : “A filename in your dataset seems to include invalid characters, please remove any accent in the names of the files you are uploading, etc.”

Could you please add in the documentation what are the requirements in terms of file names (or even better, make the parsing more permisive in terms of names)? Thank you very much!

job id: 424616

Lionel Staff answered 1 week ago

Dear Carolina,
Yes, we should definitively improve the parsing of PDF files and/or update the documentation.
In your new dataset, just replace “-” by “_” in your pdf filenames (for Grima-Farrell_et_al_2011.pdf | Hemsley-Brown_Sharp_2003.pdf | Rycroft‐Smith_2022.pdf …)
I hope it helps!
L
 
 

Carolina Rau answered 1 week ago

Thank you very much Lionel! It is working now 🙂