troubles parsing json files from ISTEX

CorText Manager Q&A forumCategory: Data processingtroubles parsing json files from ISTEX
Déborah Abhervé asked 2 weeks ago

Hi there ! We are trying to analyse a corpus of articles from the French journal “La Houille Blanche”, using the web service ISTEX : https://dl.istex.fr/?q=host.title%3A%22la+houille+blanche%22+&extract=enrichments%5Bmulticat%2Cnb%2Crefbibs%2Cteeft%2Cunitex%5D%3Bfulltext%5Bpdf%2Czip%5D&size=5841&rankBy=qualityOverRelevance&archiveType=zip&compressionLevel=9&sid=istex-dl&usage=1

We downloaded 100 articles with text files and json metadata. Now CorText won’t recognise the json files (“log file not found”) when attempting to parse. Here are the options we used : https://photos.app.goo.gl/oMgBDPFJXC3BoN2G8 Here’s the file used : https://drive.google.com/file/d/1Z62eRiXZNaE-Rg1SA5yOVEUV65lDgeTb/view?usp=sharing

Any idea ? Thanks in advance.

3 Answers
Lionel Staff answered 2 weeks ago

Dear Deborah,

From what I saw, you cannot parse at the same time TXT files with full text, and JSON files with metadata. It is two data formats.
So there are several options, to achieve what you want.

  1. You can separate TXT and JSON, in two distinct corpuses, zipped, and parse it using TXT and Json (multiline) options, and analyse them separatly.
  2. You can link the TXT and the metadata (but you will need to transform the json files in csv) with the ids of the documents
  3. You can parsed the two corpuses, export the metadata from the json datatset using corpus explorer, and bring it back to the TXT dataset with corpus list indexer using the ids of the document as a link (see the video there : https://docs.cortext.net/data-parsing/#step-by-step-tutorial-on-how-to-prepare-a-csv-file).

For the JSON files, you have to use the JSON (multiline) options for the parsing. Precise the field of the date (year). It will work, but wont fully support all attributes (e.g. author names and affiliation names are merged in the same field).

On a mid term goal we are presently working to have a easier support of ISTEX dataset.
I hope it helps!
Lionel

Déborah Abhervé answered 4 days ago

Hello !
Thank you for your answer but I don’t manage to parse the JSON files or to convert it into CSV… Maybe because the export from ISTEX is already a compressed file ? Is there another file format from ISTEX which will be possible to use on CorText to have the metadata informations ?
 
Déborah
 

Lionel Staff answered 2 days ago

Dear Deborah,
That is strange, because I have done it last time with the files you have provided, flowing one of the two links. And it has work fine.
From what I remember, just:

  • unzip the files provided by ISTEX
  • separate all the txt files in one folder AND all the json files in another folder
  • zip the json new folder
  • upload it on CorText Manager and use the Json (multiline) options for the parsing

Tell me if it is fine for you
L