indexing emojis

Christophe Prieur asked 2 years ago

Hi guys,
I’d like to index emojis.
I first tried terms extraction but no option appears to suit extracting just characters.
So i’ve tried corpus terms indexer: i’ve created a list of emojis, which i’ve put in a column, then copied two times so that it’s a three-column tsv file, this way:

stem (tab) main form (tab) forms
🤡 (tab) 🤡 (tab) 🤡
😂 (tab) 😂 (tab) 😂
🤔 (tab) 🤔 (tab) 🤔

etc.
[bloody hell, copying this table here was a pain, uploading a screenshot would have helped]

I’ve run corpus terms indexer with the followin options.

  • word boundaries: non-separated
  • normalize lexical items: no

But after a few seconds run, i got “0 occurrences indexed”.
Is that idea at all doable?

Thanks,
Christophe.


3 Answers
Lionel Staff answered 2 years ago

Dear Christophe,
It is true that with corpus terms extractor / indexer, it is not supposed to work. To double check, please invite me into you projet: lionel dot villard at esiee dot fr
For now, the only option I see is to use a dictionary of emoji to transform, in your initial corpus, the utf8 emoji codes into the full textual description.

We are investigating also the feasibility to extract emoji, emoji codes and descriptions, using our Name Entity Recognizer script.
I hope it helps
Lionel

Christophe Prieur replied 2 years ago

Hi Lionel,
Done. https://managerv2.cortext.net/project/144590005283
See indexer script at 2022-04-20 15:04:50

What i don’t understand is why the option “word boundaries: non-separated” doesn’t fit.

Lionel Staff replied 2 years ago

From what I see the original csv file with tweets (zipped in all-dates-retravaillees-tire.zip) has not been stored in UTF8 (or has been accidentally converted before zipping it). It is ANSI charset encoding: emojji have been converted mainly to “??”. So, while working with it in CorTexT Manager, there is no way to search for UTF8 emojji codes.
L

Christophe Prieur answered 2 years ago

Finally it’s ok, you were right Lionel, i was using a wrong version of the file, whose encoding had been broken (by Excel of course, what else?)
Here are the winners: 😂🤣🇫🇷😉👍🤔
By number of distinct documents (mainly tweets during French election campaign):
1458
1057
909
578
412
362
Christophe.

Lionel Staff answered 2 years ago

Dear Christophe,
A new feature has been published: https://docs.cortext.net/named-entity-recognizer/#emojis
NER script of CorTexT Manager is now able to annotate, extract and use emojis in analyses (in a network, to tag…). Among other interesting new features!
L