Some little problems I can find with the approach mentioned in the #SigmaDiff paper are the following:
* Adjacency matrices, even when using sparse structures, are huge for real world functions.
* Calculating an inter-procedural data dependency graph for real world binaries takes ages.
* I don't understand why are them using a symbolic analyser (might need to re-read it).
* The Cisco Talos Datasets 1-2 don't contain binaries for anything that isn't Linux + GCC or Clang.