This documentation provides detailed information about Cortext Manager.
Once you have produced and uploaded your corpus, the first required step is to parse it (the parsing script should be automatically launched after upload). This task will convert the original corpus into a convenient format (namely sqlite database) which is tractable for further treatments.
Different tools (scripts) are then at your disposal to analyze this dataset. Before starting them be sure to always choose the parsed file (.db file) as a corpus. It is also advised to read the log of jobs clicking on the flag (green or red) that appears in the project page once a script has started. When the job has failed, the last line of the log should provide some succinct explanations. If not, please don’t hesitate to report the bug to the forum.
CorText proposes a full ecosystem of modeling and exploratory tools for analyzing text corpora. The illustration below shows a global scheme of possible treatments. Users are then free to define their own workflow of analysis.
- Demography will generate basic descriptive statistics about the structure and evolution of the main fields in your dataset,
- Lexical Extraction automatically extracts list of pertinent terms using NLP technics,
- Named Entity Recognizer detects named entities such as persons, organizations, locations, etc.
- One can also indexes databases with their own custom terms list, a dedicated interface is proposed to easily create your own lists,
- Heterogeneous Networks Mapping performs homogeneous and heterogeneous network analysis and produces intelligible and tunable representation of dynamics,
- Topic Modeling offer a powerful solution for analyzing the semantic structure of collections of texts,
- Contingency Matrix provide a direct visualization of existing correlations between distinct fields in your data,
- Period Detector longitudinally analyzes the composition of your data to automatically detect structurally distinct periods,
- You can customize the periods you wish to work on with Period Slicer. Quantative data may also be very easily pre-processed with the Data Slicer script,
- Query A Corpus to create any sub-corpus resulting from a complex query,
- Different scripts allow to filter out and clean categorical lists: List Builder and Corpus List Indexer,
- Distant Reading builds an interface which allows to compare the dynamic profiles of words in a dynamic corpus,
- Correspondance Analysis script provide minimal facilities to perform a multiple correspondance analysis on any set of variables.
- Contrast Analysis is an exploratory tool allowing to visualize terms with are over/under-represented in a given sub-corpus,
- Word2Vec Explorer maps large number of words which positon has been trained using word2vec model.
If you don’t have any dataset available, please feel free to use this dataset of recipes from a former Kaggle competition featuring a set of almost 40 000 cooking recipes along with their regional cuisine origin. Simply upload the zip file, parse the dataset as a json file and start exploring (see too maps produced with this dataset below)! The corpus compiling every State of the Union adresses since 1790 is also available for download. Discourses are divided by paragraph. The speaker and the year of adresses are also included.
Additionally to this website, slides used for training can also be downloaded following this link. Frédérique Mélanie and Pablo Ruiz have also written a rich visual guide for people wishing to start using CorText. They accepted to share it with the community.