Period Detector

Two periods detected

Composition of the dataset at different time steps

Period detector directly works on the frequency distribution of (a) given field(s) to produce a matrix that maps the distance between the composition of the dataset at two different time steps (in the example above indexed by years for instance).

Field

One should define the fields to consider for constructing frequency vector profiles at each time step. The “dissimilarity” between two time-steps is then computed as 1 minus the cosine value between vectors formed by the frequency values of each  “year”. For instance on the example above, every diagonal cells score 0 because profiles are perfectly aligned. The whiter the cell, the most dissimilar two time steps are.

Enter the number of periods you which to detect

The script also automatically computes the partition which optimally divides the time in a given number of periods. The algorithm searches the cut times that optimize the sum of the homogeneities of each sub-block. If set to zero, the number of slices will be automatically computed using a statistical criterion (Tibshirani et al., 2001).

Top items

Restrain the computation to the N most  frequent items from the selected field(s) (a useful feature when field distribution is heterogeneous as dissimilarity measure may then be sensible to noise).

What to do next?

Based on the matrix and the periods detection, you may want to apply these periods using Period Slicer script to define and use Custom periods in your dataset.

Reference

Robert Tibshirani, Guenther Walther and Trevor Hastie, Estimating the number of clusters in a data set via the gap statistic, J.R. Statist. Soc. B (2001), 63, Part 2, pp. 411-423 (online).