This blog post was written by Vincent Weafer.
Big data introduces new wrinkles for managing data volume, workloads, and tools. Securing increasingly large amounts of data begins with a good governance model across the information life cycle. From there, you may need specific controls to address various vulnerabilities. Here are a set of questions to help ensure that you have everything covered.
1. What is your high-risk and high-value data?
Data classification is labor intensive, but you have to do it. It just makes sense: The most valuable or sensitive data requires the highest levels of security. Line-of-business teams have to collaborate with legal and security personnel to get this right. A well-defined classification system should be paired with determination of data stewardship. If everybody owns the data, nobody is really accountable for its care and appropriate use, and it will be more difficult to apply information lifecycle policies.
2. What is your policy for data retention and deletion?
Every company needs clear directions on which data is kept, and for how long. Like any good policy, it needs to be clear – so everyone can follow it. And it needs to be enforced – so they will.
More data means more opportunity, but it can also mean more risk. The first step to reducing that risk is to get rid of what you don’t need. This is a classic tenet of information lifecycle management. If data doesn’t have a purpose, it’s a liability. One idea for reducing that liability in regards to privacy is to apply de-identification techniques before storing data. That way you can still look for trends, but the data can’t be linked to any individual. De-identification might not be appropriate for any given business need, but it can be a useful approach to have in your toolbox.
3. How do you track who accesses which data?
How you are going to track the data, and who has access to the data, is a foundational element of security. As your analytics programs become more successful, you are likely to be exposed to more sensitive data, so tools and storage mechanisms should have that tracking capability built in from the beginning. After all, if you don’t have the right tracking tools in place at the outset, it’s hard to add them after the fact.
4. Are users creating copies of your corporate data?
Of course they are. Data tends to be copied. A department might want a local copy of a database for faster analysis. A single user might decide to put some data in an Excel spreadsheet, and so on.
So the next question to ask yourself is this: what is the governance model for this process, and how are policies for control passed through to the new copy and the maintainer of this resource? Articulating a clear answer for your company will help prevent sensitive data from leaking out by gradually passing into less secure repositories.
5. What types of encryption and data integrity mechanisms are required?
Beyond technical issues of cryptographic strength, hashing and salting and so on, here are sometimes-overlooked questions to address:
- Is your encryption setup truly end-to-end, or is there a window of vulnerability between data capture and encryption, or at the point when data is decrypted for analysis? A number of famous data breaches have occurred when hackers grabbed data at the point of capture.
- Does your encryption method work seamlessly across all databases in your environment?
- Do you store and manage your encryption keys securely, and who has access to those keys?
Encryption protects data from theft, but doesn’t guarantee its integrity. Separate data integrity mechanisms are required for some use cases, and become increasingly important as data volumes grow and more data sources are incorporated. For example, to mitigate the risk of data poisoning or pollution, a company can implement automatic checks flagging incoming data that doesn’t match the expected volume, file size or pattern.
6. If your algorithms or data analysis methods are proprietary, how do you protect them?
Protecting proprietary discoveries? That’s old hat. What’s easier to miss is the way you arrive at those discoveries. In a competitive industry, a killer algorithm can be a valuable piece of intellectual property.
The data and systems get most of the glory, but analysis methods may deserve just as much protection, with both technical and legal safeguards. Have you vetted and published a plan for securely handling this type of information?
7. How do you validate the security posture of all physical and virtual nodes in your analysis computing cluster?
Big-data analysis often relies on the power of distributed computing. A rogue or infected node can cause your cluster to spring a data leak. Hardware-based controls deserve consideration.
8. Are you working with data generated by Internet of Things sensors?
The key with IoT is to ensure that data is consistently secured from the edge to the data center, with a particular eye on privacy-related data. IoT sensors may present their own security challenges. Are all gateways or other edge devices adequately protected? Industrial devices can be difficult to patch or have a less mature vulnerability management process.
9. What role does the cloud play in your analytics program?
You’ll want to review the contractual obligations and internal policies of those hosting your data or processing. It’s important to know which physical locations they will use, and whether all those facilities have consistent physical (not just logical) security controls. And of course, the geographic locations may impact your regulatory compliance programs.
10. Which individuals in your IT organization are developing security skills and knowledge specific to your big-data tool set?
Over time, your project list, data sets, and toolbox are likely to grow. The more in-house knowledge you develop, the better your own security questions will be.
Read the original post on Dark Reading.