Adding metadata to pdf corpus

Cortext Manager Q&A forum › Category: Data processing › Adding metadata to pdf corpus

Daniel Bach asked 6 years ago

Dear CorText team
Thank you for all the great work on your tool!
I have a problem indexing my metadata.
I have uploaded a corpus of PDF files, done terms extraction and network mapping with it.
I have also cleaned some terms from the terms extraction and successfully indexed the new termslist
My problem is that i now have a Google Sheets file with metadata that i wish to add to my database so that i can project it onto my network map
I cannot figure out how to do this with the Corpus List Indexer script. I download my csv with tabulated seperators, but i cannot get the metadata to display on my network visualization.
Link to the spreadsheet file i am attempting to index
https://docs.google.com/spreadsheets/d/1sIuq11PT7ouw7Kd8If8VWCWv74H-1MpckUct2q__esM/edit?usp=sharing
I hope you can provide a step-by-step guide of how to do this?

All the best

Daniel Bach

Question Tags: adding metadata to PDF corpus, Corpus List indexation

Lionel Staff replied 6 years ago

See below!

2 Answers

1 Vote Up Vote Down

Lionel Staff answered 6 years ago

Dear Daniel,

What you are describing looks good. From what I have understood you are close to apply your metadata on your dataset!
You can follow the first part of the video tutorial below, to apply meta data to a collection of pdf, txt (…) files, you only have to use the file names as a variable to link the two information (named filename in your dataset).

In addition, if you do not want to borrow you with temporal information, as you already have a year information in your meta data, please rename the column as: “ISIpubdate”. After the meta data indexation, CorText Manager will be able to directly use this information in all scripts where a temporal information could be used.

This video demonstrates how to build a corpus from txt files, enrich it with proper time steps and use distant reading script:

I hope it helps!
Lionel

Daniel Bach replied 6 years ago

Dear Lionel

Thank you for your response!
I have now tried following the steps in the video, but the output that I get from the Corpus List Indexer script has no list object, neither am I able to project metadata on my co-occurence map.

I see that in the video there is an option to parse the CSV that is oploaded with the metadata. There seem to have been an interface change since the video was made? When I click “opload file” there is no parsing step prompted and if i try and use the data parser script I can only parse the original zip file containing my PDFs, not the new CSV file.

I hope you will be able to help me with this problem

All the best

Daniel Bach

0 Vote Up Vote Down

Lionel Staff answered 6 years ago

Dear Daniel
Yes, exactly: we have a new upload button / process! We hope it is simpler this way.

Click on the “upload a file” button
Drag and drop your corpus / dataset / documents / zipped file
Waiting until the end of the upload process
When it ends, the drag and drop section may become green: right click on it, and you will directly go the parsing step.

In any case, at any time, you can go the script list (start script > Corpus > Data parsing) and find the parsing script to parse your zipped file independently to the uploading step.
I hope it helps!
L

Lionel Staff replied 6 years ago

This question is related to that one: https://docs.cortext.net/question/terms-list-in-type-of-data-missing/
And the process has been now documented here: https://docs.cortext.net/upload-a-resource/#upload-process

Cortext Manager Documentation

Learn about Cortext methods and share your experiences