David Marx bio photo

David Marx

Data obsessed

Twitter LinkedIn Github Stackoverflow CrossValidated

Background

This blog post discusses my most recent research in a series of experiments that I hope to publish academically soon. A good place to get some context on what I’m talking about is the first article in the series: Identifying Turmoil in Social Networks With Graph Anomaly Detection. In particular, you should check out the section Methodological details for graph anomaly detection, as the discussion below presupposes familiarity with the procedure described there.

Procedure

Last week, I described a bipartite projection that produces a graph model whose edges can be interpreted as binomial probabilities. These probabilities are represented by a number of successes (in the particular example from last week, the count of mutual users) and a number of total trials (the total number of active users for a particular subreddit). By performing this projection over a series of time slices of a dynamic graph (i.e. a graph with a time component) we get a series of binomials associated with each edge over time. This lends itself very nicely to the problem of anomaly detection: in the case of the social network I am studying (reddit), I am particularly interested in unusual spikes of activity across the graph, and these “spikes” have a very straight-forward interpretation at the edge level when we apply this binomial framework.

For each edge in the “current” time slice of the graph we are interested in, we infer the expected activity by summing up the successes and trials associated with that edge over a window of time slices of the recent past. This gives us a “baseline” proportion. We can then identify edges that exhibit statistically significant increases in activity by invoking a simple test for difference in proportions. We then filter each time to statistically significant edges, forming a graph that specifically models significant increases in activity.

Results

Qualitative performance

I’ve found that performing anomaly detection procedures on the significance graphs has similar or better results (qualitatively), which supports the the theory that these edges contain the bulk of the information relevant to identifying anomalous community behaviour. Let’s compare the results of applying the anomaly detection procedure using a edge count for the local scan statistic on order-1 neighborhoods (i.e. k=1) normalized against the previous two time periods (tau=2) on the reddit public comments dataset from January 2008 through November 2016.

Because the graphs in the filtered dataset are much smaller we can’t compare the values of the scan statistics directly, but if we normalize the scan statistics relative to the max value we observed (after removing the January 2015 observation, which was a tremendous outlier in both datasets that I need to investigate) we see that the scan statistics performed very similarly. Despite the differences in the two data sets, the procedure identified the same anomaly locus in 22 periods. For the most part, the statistics agree about the most turbulent periods in the graph.

Let’s hone in on the results from the last two years of data and compare the results side by side. In the following table, “nexus anomaly” denotes the node in the graph (i.e. the subreddit) that had the highest value for the local scan statistic in that period.

date Anomaly Nexus (filtered) Anomaly Nexus (unfiltered) scan statistic (filtered) scan statistic (unfiltered) Normalized scan statistic (filtered) Normalized scan statistic (unfiltered) Does Anomaly Nexus Match?
2015/02 betterCallSaul FeelsLikeTheFirstTime 797 46823 15% 16%  
2015/03 projecteternity dnbhl 332 36462 6% 13%  
2015/04 thebutton thebutton 5285 290220 100% 100% X
2015/05 Boxing UnitedColors 527 40516 10% 14%  
2015/06 fo4 freegold 689 89904 13% 31%  
2015/07 Blackout2015 Blackout2015 2386 193179 45% 67% X
2015/08 MensLib MensLib 299 52802 6% 18% X
2015/09 apple Amd 378 77305 7% 27%  
2015/10 baseball Nexus6P 1603 100583 30% 35%  
2015/11 JessicaJones GGFreeForAll 197 42558 4% 15%  
2015/12 MakingaMurderer MakingaMurderer 627 90627 12% 31% X
2016/01 stevenuniverse cssfuckfest 668 65758 13% 23%  
2016/02 thewalkingdead CapitalismVSocialism 1679 158323 32% 55%  
2016/03 Defenders ZombieSurvivalRP 960 81655 18% 28%  
2016/04 sweden AskBernieSupporters 1100 128069 21% 44%  
2016/05 Battlefield Mr_Trump 1471 162830 28% 56%  
2016/06 uncensorednews uncensorednews 2825 161728 53% 56% X
2016/07 TheSilphRoad Incels 1258 103127 24% 36%  
2016/08 olympics AskThe_Donald 4378 161663 83% 56%  
2016/09 southpark CivilizationSim 1112 94475 21% 33%  
2016/10 westworld NintendoSwitch 2494 132032 47% 45%  
2016/11 the_meltdown shittyrainbow6 2932 141939 55% 49%  

Again, we can see that the two approaches achieved similar results, especially in many of the most tumultuous periods. Concretely, the highlighted periods/subreddits that the filtered and unfiltered procedures agree upon are associated with the following events:

Period Anomaly Nexus Event likely associated with the observed anomaly
April 2015 /r/thebutton Reddit’s 2015 April fools “prank” spawns a sprawling micro-community
July 2015 /r/Blackout2015 The firing of a Reddit employee triggers site-wide unrest in the form of protests and demonstrations of various kinds
August 2015 /r/MensLib Not gonna lie: I consider myself something of a Reddit historian at this point, but I have no idea what was significant about this community August 2015
December 2015 /r/MakingAMurderer The Netflix series Making A Murder premiers to significant acclaim
June 2016 /r/uncensorednews The worst mass shooting in US history occurs at the Pulse Nightclub in Orlando

There were, however, some important places where they disagree. The significance filtered approach seems to generally have had “better” hits:

Period Anomaly Nexus Event likely associated with the observed anomaly
February 2015 /r/betterCallSaul Better Call Saul, the much awaited spin-off of Breaking Bad, is released
May 2015 /r/Boxing The Floyd Mayweather vs. Manny Pacquiao fight is the most anticipated boxing match in recent memory
October 2015 /r/Baseball The 2015 World Series occurs
November 2015 /r/JessicaJones The Netflix original TV series Jessica Jones is released to significant acclaim
April 2016 /r/Sweden Sweden’s housing minister resigns after comparing Israel to Nazi Germany
May 2016 /r/Battlefield The upcoming release of the video game Battlefield 1 is announced
July 2016 /r/TheSilphRoad The augmented reality game Pokemon GO! is released and quickly becomes the most used mobile app in the world
August 2016 /r/Olympics The Summer Olympics occurs
September 2016 /r/Southpark Season 20 of South Park commences
October 2016 /r/westworld The HBO series WestWorld debuts to significant acclaim
November 2016 /r/the_meltdown Donald Trump defies predictions and wins the US Presidential election

There are a few cases where it looks like the unfiltered procedure had hits worthy of note not caught by the filtered procedure

Period Anomaly Nexus Event likely associated with the observed anomaly
February 2016 /r/CapitalismVSocialism The ongoing US presidential election
April 2016 /r/AskBernieSupporters The ongoing US presidential election
May 2016 /r/Mr_Trump The ongoing US presidential election: in-fighting occurs in the larger community of Reddit’s Trump supporters
August 2016 /r/AskThe_Donald The ongoing US presidential election

In all of these cases, this disagreement does not appear to suggest any fault or weakness on the part of the filtered version of the procedure: in all four of those periods, the filtered procedure honed in on a subreddit that was the focal point for a significant event in the community. There isn’t necessarily a single source of community tumult that is potentially of interest to us in a community as large, wide, and diverse as Reddit, so small variations in anomaly detection procedure can produce different but equally interesting results. I hope to share some of the results of additional modifications to the anomaly detection procedure in the near future (such as expanding the neighborhood beyond k=1 and using different local scan statistics).

Data compression

The “signficance graphs” resulting from this edge threshholding are orders of magnitude smaller – by edge count and positive-degree nodes – than their unfiltered counterparts.

Additionally, the ratio of vertices to edges is roughly constant (approximately 0.245) in the filtered graph, whereas this ratio approaches zero as the graph grows in the unfiltered graph.

In other words, edge count grows linearly with node count in the significance graphs, but edge count grows much faster than node count in the unfiltered graphs.

Performance gains

Furthermore, because the filtered graphs are significantly smaller, the anomaly procedure is accomplished in seconds where it previously took several minutes, for first order local scan neighborhoods. Here are some performance comparisons between the significance filtered and unfiltered datasets (2007-2016, monthly graphs). All experiments performed on a conventional laptop (and for all experiments, tau=2):

Procedure Neighborhood order (k) Time
Unfiltered 1 3.5 minutes (215 seconds)
Unfiltered 2 Stopped after an hour
Unfiltered 3 Not attempted
Filtered 1 40 seconds
Filtered 2 44 seconds
Filtered 3 46 seconds

Conclusion

The application of binomial significance edge threshholding to the dynamic bipartite projection of a user-community social network results in a significant distillation of the information relevant to the anomaly detection task. The resulting sequence of graphs are orders of magnitude smaller, and the relationship between node count and edge count stabilizes to a linear relationship. This suggests that as node count increases, the relative performance gains from applying this procedure increase rapidly. This increased performance does not come at any noticeable cost to the quality of our results: in fact, it appears to improve the quality, which makes sense given that the justification for the thresholding we are performing is closely tied to the particular kinds of events we are hoping to detect.

Additionally, significantly reducing the edge count gives us much more flexibility in our ability to experiment with modifications to the anomaly detection procedure, as running local scan statistics on higher order neighborhoods (or even just playing with the particular local scan statistic we are using) becomes significantly more tractable on conventional hardware. A more complex anomaly detection procedure doesn’t necessarily ensure better results, but at least we now have the opportunity to experiment.