PDF filenames changed after parsing

Cortext Manager Q&A forum › Category: Data processing › PDF filenames changed after parsing

etancoigne asked 6 years ago

Hello Cortext!
I uploaded a corpus of 3’000 PDFs as a zip file. Each PDF has a name based on the pattern “Year_Volume_whatever.pdf”.
I parsed it successfully (“Split the text content by sentence” : No, “Ignore entries with incorrectly formatted time steps” : Yes)
I got 3 fields : Time Steps, text, filename. While exploring the corpus, I could see that Cortext cropped my initial filename to fill the “filename” field: Now it is “Volume_whatever”. The field “Time Steps” is filled with what seems to be the “Volume” part of my filename (no years at all).
I planned to index my corpus with another database that I have, which has a “filename” fields corresponding to my initial filenames (“Year_Volume_whatever.pdf”).
How can I prevent Cortext to change the filenames I used?
Thanks a lot,
Elise

Question Tags: parsing; PDF; filenames; indexation