Who governs the data?

When it comes to privacy, should we allow for secret handshakes to take place? In case you haven't been paying attention to trends in data lately, data governance is a hot topic. As we strive to implement regulations and best practices to ensure continued exchange of privacy in our industrial applications, we are engaging in a process of data governance.
no more secret handshakes...
The core idea here is to take governance policies and to translate them into requirements for our applications and for our data. A lot of approaches, in my opinion, tend to over complicate this idea. The goal of this post is to suggest some straightforward approach that I believe we can all deploy today in order to better protect our data.

So perhaps you are following what I'm saying and perhaps you want to come up with some data governance policies. How do you know which policies to go after? The answer comes from two sources. First, from the top-down we examine what our regulatory bodies and what industry best practices are suggesting. Second, from the bottom-up we can actually use an anomaly detection frameworks to find peculiar examples of use within the access patterns for our data. From the top-down we engage in an external research process. For brevity, I will need to come back to this later idea in a subsequent blog post.
can we just be silent....
External policies contain a wealth of knowledge we can build upon to inform the types of policies that our businesses should maintain. These policies often come in the form of explicit guidelines from authoritative sources. For example, HIPAA governs the legal use of personal healthcare information. In addition to authoritative sources, we also have influential sources. These typically come from groups of businesses coming to consensus on best practices for a particular type of marketplace. For example, the automotive industry has data guidelines for the use of automative data. Finally, each organization typically has their own standards that are either made obvious or are implicitly present in day-to-day operations. To discover these, talk to the experts in your organization. As you research these guidelines, keep track of the types of data they discuss, to the access restrictions they recommend, and to the scope of use they recommend.

Once we have the guidelines, we need to figure out how to translate these into procedures we can implement into our data applications. Translating is impossible if we do not have a way to define concept of scope. Scope relates to user access, to data type, and to use case. I think people are used to understanding the restriction of data by user type and by use case. However, the concept of data type can actually be fairly complicated. Since data can change types based upon the way it has interacted with various other systems, we need to find a way to track the lineage of each data record. Operational meta data (OMD) refers to the tags we place on an atomic datum to record where it has been in the past. This allows us to create a data lineage for each record. The most obvious way this typically manifests is by keeping track of the source application, creation date, and creation location for each record. Once implemented, such lineage linking then allows for a finer-grained ability to translate create business rules and heuristics to data types.
building data governance is up to the experts....
Now that we have policies that relate to users, use cases, and OMD types, we can now start to define the defects in our systems. Defects are occurrences of policy violations. So far I have focused primarily upon explicit policies. These policies only tell us whether some type of access is absolutely allowed or disallowed. In such a system we can test its effectiveness by creating fake data access calls and then determining whether we detected defects for these calls. This synthetic transaction approach is very useful for testing the most important types of data access policies. However, in more complex systems such unit testing may not be sufficient as the number of types of interactions increases combinatorially. In addition, such binary decisions may be too rigid for real world applications. For example, sometimes privacy can relate to having an infrequent pattern of access. For these reasons the next post will further discuss some probabilistic techniques we can use to further evaluate the presence of privacy defects in our applications.

Final words: while privacy is not dead, it is up to use to continue to protect and define it. It is worth mentioning that while this approach promises to control data access, it is not necessarily an approach to building a secure system. While these polices tend to relate to the enforcement of various access criteria, it is different than the type of approach you would use to build security into an application. This goes back to the first debate in the last post which discussed the similarities and differences between privacy and concepts such as security, anonymity, and ownership.