Building Virtual Crime Scenes for the eDiscovery World
Back in November, we applied for funding through a BAA grant entitled ADAMS – Anomaly Detection At Massive Scales. We should find out if we won any funding some time this month. In the meantime, Fast Company found one of my partners and through him, me. The article stemming from those interviews can be found here. It’s worth a read.
Take a moment and do some research on the ADAMS problem. If you’ve any experience with ediscovery, or complex computer forensics cases, you might begin to think that you’ve seen this problem before on a smaller scale. Note that the ADAMS announcement specifies that the providers must provide test data – the providers need to prove that their products work in a controlled, instrumented environment before they’re released into the wild. Further, the people running the project must see the results before the solutions are accepted.
Hmm. What if we could do the same for ediscovery? What if you could have three vendors on site and compare them, on known data, head to head?
And, what if you could run known data through an ediscovery tool or process and accurately measure that process? What if, in so doing, you found that the process was flawed? If it is your process? Your vendor’s process? Your opponent’s process?
Oddly enough, we’re developing tools to help you answer some of those “What if”s.
In the course of the interview, I came up with an analogy for our process which the reporter captured quite well – we’re creating virtual crime scenes. Crime scenes that can be adjusted, wiped clean, rebuilt, or used over and over again. Further, we’re populating these entirely electronic crime scenes with real evidence – documents with accurate metadata, email messages with legitimate headers, SMS messages with topical content.
To digress a bit, the last item is the most difficult, and the most interesting. It is easy to sanitize existing content, and fairly easy to generate responsive content wrapped in digital noise, but can we create a reasonable approximation of human generated content, and keep it on topic? Can we create, out of the whole cloth, email conversations that appear to discuss a particular business topic in a manner that ensures they will be, or will not be, responsive to particular criteria?
No, not immediately, but we’re on the right path. And please don’t get too distracted by our desire to include natural language processing at some point as there is an enormous amount of value we can add now, and in the near future.
We already build virtual crime scenes, or digital corpus representing the corporate computing environment to be processed by ediscovery tools. And knowing how the corpus was built down to the last byte allows us to determine the accuracy of the ediscovery process, down to the last byte.
Stay tuned, interesting times are coming for the ediscovery world.