Finding non-randomness in time-series

Consider a set of areas. For each area you have a table of recorded crimes, in particular you have a time-stamp for when the crime was comitted. If you job is to find evidence of that there is serial criminality, in the sense that some crimes seems to be too close in time to be unrelated. The problem is that even chance would, occasionally, produce concentration of events in time. So what kind of strategy would be reasonable to find the evidence of serial criminality?

First one could note that there are different targets to aim for with the analysis. One target could be to make individual classifications of all crimes in two classes:

not linked to a particular prior event
linked to an prior event

The problem of false positives is a major one here.

Another target could be to estimate the proportion of events in each of these two categories.

Counter to my intuition, it seem the latter is possible without doing the former, and I will try to explain the method I have developed for this analysis.

1. For each area record the total number of crimes in this area during the period. 2. Simulate a temporal distribution of crimes over days for that number of crimes and a period that long. 3. Create a data-set with all the temporal-distances from the simulation (not only the distance between two events are of importance, also the set of distances between three, four, five, ... events are of interest). 4. If n = number of events, for 1..(n-1) find the shortest temporal distance that is within the limits within which 95 per cent of the observations lay. 5. Use these threshold to classify the individual events in the area as either linked to a prior event or not (the first event in a serie is not linked, since it was independent when it took place, later events does not change that).

Now, we have done what I said would not be necessary, but this is only a temporary move and we should trust use these intermediate classifications per se.

6. Calculate the proportion linked and not linked events. 7. Substract the expected 5 per cent from this proportion. 8. The remaining proportion is what we are looking for: the proportion unexplained non-randomness.

If we want individual classification, what can be done to solve the problem of false positives?

For each time-serie (distance between two or more points), record its p-value. (if all areas had had the same number of crimes this would not be necessary, but when the do not, the raw temporal-distances in them are not comparable with raw temporal-distances in other areas).

Plot the cumulative frequency of the sorted observed p-values and the sorted simulated p-values (normalised, to control for the number of areas, since the simulations will have many more areas than the observed data set).

Distances (time-series) that have a significantly higher derivative value than the corresponding value at the simulated curve are linked.

Does the process over-estime the proportion by counting all distances that a single event are part of? No, the not-linked events are also counted more than once.

comments powered by Disqus

Back to the index

Blog roll

R-bloggers, Debian Weekly

Last modified: oktober 17, 2019