Text processing in Spanish

CorText Manager Q&A forumCategory: Text processingText processing in Spanish
Yuri Saldarriaga asked 9 months ago

Dear Lionel:
I am using Cortext to process documents in Spanish for my master thesis, and I have some questions. First, I would like to know if you recommend using CorText for analyzing corpus in Spanish. I read in the documentation that this version has been less extensively tested.
Second, I would like to avoid the use of some words in the analysis, and I was wondering if CorText has a dictionary of empty words in Spanish. I read the documentation and I tried to do that through the “Query type” function. However, I could not exclude the words I needed to. My corpus has documents in txt format, and I am trying to exclude some Spanish words that have not relevance for my analysis or are empty (such as connectors, prepositions or articles).
Thank you very much!
Yuri

2 Answers
Lionel Staff answered 9 months ago

Dear Yuri,
Yes, we are recommending using CorText Manager in every situation 🙂
When we are saying that Spanish has been less intensively tested than French and English, it does not mean that no one has used It yet. Some researchers have used it for their own work, and some studies in Spanish has been already published. But feedbacks are very welcome if you find some strange noun phrases extracted with the lexical extraction script. Do not hesitate!
I would recommend for you to:

  • perform a lexical extraction (but if I understood well you have already done it)
  • refine the extracted noun phrase (you can exclude some words, build a block list etc.). You may want to watch the video which is below, beginning at the 5:00min
  • apply your cleaned list of noun phrases to your corpus with terms indexer script

Do not hesitate to come back if it is unclear or if you have more questions.

I hope it helps!
L

Yuri Saldarriaga answered 8 months ago

Thank you very much for your answer, Lionel. Yes, I have done the lexical extraction. Then, I followed your instructions, and I was able to exclude the words I need to.

Now, I have another question. I have tried many times to index a new category in my corpus, but I could not do it. I followed the next steps:

  1. I used the list builder to download a list (using the filename option). Then, I opened it in Excel, and found three columns: “entity”, “frequency” and “number of distinct documents”. I included a new column with the year, next to “entity”, and saved the document as CSV. At this point, I have tried deleting the columns ” frequency ” and ” number of distinct documents”, I have saved the list in other CSV formats, tried in Mac and Windows, etc.
  2. I uploaded the CSV file.
  3. I started a new script using the “corpus list indexer” function. In “field” I put “filename”, I said not in “Define a custom list of entities” and yes in “Add a dictionary of equivalent strings”, where I selected the list I had uploaded (in some attempts, I put yes in both options, and selected the CSV I uploaded). Here, I also changed the name of the new process.

At this point I have the first problem. When I see the “corpus list indexer” file it has three columns with the same information, and without the year I included in the CSV list. I mean, I have the same data in “entity”, “entity label” and “forms”. So, when I try to do the next analysis, it appears an error.

I do not know what else I can do. I would appreciate your help!

Thank you very much,

Yuri