Compare 2 corpuses

Déborah Abhervé asked 4 years ago

I have two corpuses of references (from Scopus). The first has 1150 references, the second 200 references. I would like to know to what extent the authors of the second corpus are present in the first corpus and how they are distributed.
I have not been able to find a way to automatically indicate that such and such authors from the first corpus were also in the second corpus.
Is there a way to compare two corpuses in CorText?
Thank you for your help.
Lionel Staff replied 3 years ago

Lionel Staff answered 3 years ago

Dear Déborah,
CorText Manager has been originally made to work with data in “silo”, meaning that analysis can be done on two datasets at the same time.
But you still can produce some resources from one dataset, or more, and use it in another dataset or more.
For your question, the most straightforward way, would be to:

  • run a list builder of the author variable in both datasets;
  • download the two lists, merge them, and tag the authors depending if there are coming from the first dataset or the second one, or both;
  • Reupload the merged list, and run a list indexer to use in you datasets;

Depending on what are the questions you want to address, you could also:

  • Build a list of the ids of your scientific articles, and tag the authors depending if there are coming from the first dataset or the second one, or both;
  • merge the two datasets (be careful if some articles are redundant between the two datasets)
  • upload your list of ids with the tag, and run a list indexer tow apply the tag on the merged datasets

I hope I helps!

Déborah Abhervé answered 3 years ago

Thank you Lionel for your answer.
. run a list builder of the author variable in both datasets => OK
. download the two lists, merge them, and tag the authors depending if there are coming from the first dataset or the second one, or both => for tagging the authors, I have added a column (Corpus_NB) with “1” or “2” if author belongs to the first or the second dataset. Is it the good way to do ?
. Reupload the merged list, and run a list indexer to use in you datasets => I uploaded a .tsv file but I did not manage to find my new column in corpus list indexer.

Thank you for your help again !

Lionel Staff replied 3 years ago

Lionel Staff answered 3 years ago

Yes, it seems good.
For me you should have three values:

  • if the author is only in the 1st dataset
  • if the author is only in the 2nd dataset
  • if the author is in the two dataset
  • perhaps, for these values, try to use strings: it will be easier for your analysis (for example: “largecorpus” and “smallcorpus” and “both”)

The tsv file, should be :

  • two columns: the first with the all the author names from the two datasets (named for example “authors”), the second with the corpus (named for example “corpus”) and with the three values;
  • tabulation separated and UTF8

And then:

I hope it helps

Déborah Abhervé answered 3 years ago

Thank you very much,
I did all that but the new list with Corpus List Indexer doesn’t have a new colum “corpus” but 3 new lines (“1”, “2”, “3”), with all the authors tagged with “1”, “2” or “3”…
(here is the link if it works for you :
I think I did something wrong… (and I will appreciate to do the next CorText formation ;-)!)
Thank you

Lionel Staff replied 3 years ago

Could you invite me in your project? I cannot access to your results 🙂

lionel dot villard at esiee dot fr