Europresse corpus parsing

CorText Manager Q&A forumCategory: Data processingEuropresse corpus parsing
Hannah asked 2 months ago

I can’t use the parsing script on a Europresse corpus (exported in html then zipped).
The error displayed is “Log file not found”.
Thanks for your help!

2 Answers
Lionel Staff answered 2 months ago

Dear Hannah,
Apparently, Europress has added a new class for (some of) the title of the articles.
If you know how to open html files, just:

  • replace all the “sm-margin-TopNews ” (no quote, space at the end is important)
  • by an empty string: “”

Save it as a new file. Zip it and parse it.
We will work soon on it, many thanks for the report!

Hannah replied 2 months ago

Thank you for your answer ! I’m not sure how to make change to the htlm file thought (would RStudio work?). I’m trying to figure this out by myself, I’ll let you know how it goes 🙂

Lionel Staff answered 2 months ago

Dear Hanna,
Even simpler. A basic notepad editor would fit : notepad++ or something else that you are used to.
I hope it helps !

Hannah replied 2 months ago

Got it ! It worked, thank you a lot and have a nice day !