Problems with parsing a .txt corpus

Cortext Manager Q&A forum › Category: Data processing › Problems with parsing a .txt corpus

matias.milia asked 6 years ago

Hi, I am trying to upload a corpus of text extracted from Mexican digital media but they won’t parse. They are all in UNICODE UTF-8, I don’t know if that can be a problem for Cortext. I have done this in the past and the process is quite straightforward, I don’t know if I am messing it up somehow. Anyways, here is the error I get from the script log:

2020-05-27 02:43:22 DEBUG : Something went wrong while trying to parse, are you sure you selected the correct corpus format ?

Any ideas? They are 126 documents, all in different ‘.txt’ files and they are zipped all together in a ‘.zip’ file.

Ps: Sorry, I tried to paste the complete error code, but the server would block me every time saying ‘A potentially unsafe operation has been detected in your request to this site. Your access to this service has been limited. (HTTP response code 403)’

Lionel Staff replied 6 years ago

Dear Matias,
Could you invite me to your project so I would be able to check for your txt dataset ? I will quit the project after.

with : lionel dot villard at esiee dot com
Best regards
L

matias.milia replied 6 years ago

Thanks Lionel, I have already sent you an invitation to join the project. FYI, text files still need some cleansing, this was just my first approach to it.
Keep me posted if there is something I can do.

Thanks again!

Lionel Staff replied 6 years ago

I made a mistake in my own mail address!!
lionel dot villard at esiee dot fr
Could you invite me again ??

matias.milia replied 6 years ago

No worries, it happens. I’ve already invited you again.
Thanks for your help!

1 Answers

0 Vote Up Vote Down

Lionel Staff answered 6 years ago

Answared inside the projet!
I hope it hepls!
Lionel

Cortext Manager Documentation

Learn about Cortext methods and share your experiences