troubles parsing json files from ISTEX

CorText Manager Q&A forumCategory: Data processingtroubles parsing json files from ISTEX
Déborah Abhervé asked 5 months ago

Hi there ! We are trying to analyse a corpus of articles from the French journal “La Houille Blanche”, using the web service ISTEX : https://dl.istex.fr/?q=host.title%3A%22la+houille+blanche%22+&extract=enrichments%5Bmulticat%2Cnb%2Crefbibs%2Cteeft%2Cunitex%5D%3Bfulltext%5Bpdf%2Czip%5D&size=5841&rankBy=qualityOverRelevance&archiveType=zip&compressionLevel=9&sid=istex-dl&usage=1

We downloaded 100 articles with text files and json metadata. Now CorText won’t recognise the json files (“log file not found”) when attempting to parse. Here are the options we used : https://photos.app.goo.gl/oMgBDPFJXC3BoN2G8 Here’s the file used : https://drive.google.com/file/d/1Z62eRiXZNaE-Rg1SA5yOVEUV65lDgeTb/view?usp=sharing

Any idea ? Thanks in advance.

6 Answers
Lionel Staff answered 5 months ago

Dear Deborah,

From what I saw, you cannot parse at the same time TXT files with full text, and JSON files with metadata. It is two data formats.
So there are several options, to achieve what you want.

  1. You can separate TXT and JSON, in two distinct corpuses, zipped, and parse it using TXT and Json (multiline) options, and analyse them separatly.
  2. You can link the TXT and the metadata (but you will need to transform the json files in csv) with the ids of the documents
  3. You can parsed the two corpuses, export the metadata from the json datatset using corpus explorer, and bring it back to the TXT dataset with corpus list indexer using the ids of the document as a link (see the video there : https://docs.cortext.net/data-parsing/#step-by-step-tutorial-on-how-to-prepare-a-csv-file).

For the JSON files, you have to use the JSON (multiline) options for the parsing. Precise the field of the date (year). It will work, but wont fully support all attributes (e.g. author names and affiliation names are merged in the same field).

On a mid term goal we are presently working to have a easier support of ISTEX dataset.
I hope it helps!
Lionel

Déborah Abhervé answered 4 months ago

Hello !
Thank you for your answer but I don’t manage to parse the JSON files or to convert it into CSV… Maybe because the export from ISTEX is already a compressed file ? Is there another file format from ISTEX which will be possible to use on CorText to have the metadata informations ?
 
Déborah
 

Lionel Staff answered 4 months ago

Dear Deborah,
That is strange, because I have done it last time with the files you have provided, flowing one of the two links. And it has work fine.
From what I remember, just:

  • unzip the files provided by ISTEX
  • separate all the txt files in one folder AND all the json files in another folder
  • zip the json new folder
  • upload it on CorText Manager and use the Json (multiline) options for the parsing

Tell me if it is fine for you
L
 
 

Déborah Abhervé answered 4 months ago

Hello Lionel,
I had some difficulties to separate txt files and json files because but I manage to do it : on Mac, create a new “smart folder” and select json files only (instead of copy paste each file…). I did that and zip the new folder, upload on CorText and parsed it for Json multiline but it doesn’t work… I’ve got the same message each time

Debug Log:
Error! Log file not found.

I send you an invite so that you can try and check yourself what is the problem, in case you have time to.

Thank you very much !
Déborah
Lionel Staff replied 4 months ago

Yes, please, you can add me in your project using : lionel dot villard at esiee dot fr

Lionel Staff answered 4 months ago

See my answer directly in your project!

  • Remove the manifest.json file which does not contain any data and does not have the same structure;
  • Zip the json files directly (zip > list of .json). You have had a folder in the zip file (zip > folder > .json list).

I hope it helps.
L