A Data Anonymization Method Immune to a Hacker A.I.
Data anonymization is a hot topic and for good reason. With so many breaches of privacy and the potential of even worse scenarios coming about, it's no wonder that everyone is aware of this matter in the data community. This includes data scientists too since we often need to deal with sensitive data of this sort, usually referred to as PII. However, common anonymization methods such as hashing although useful, don't stand a chance against modern A.I. systems that can figure out the information we are trying to hide, through clever deductions based on the remaining data.
Why if there was a method for anonymizing the data at hand, all while maintaining the relationships among the variables this data consists of? This simple question may make the whole process seem straight forward but it’s much more challenging than it seems. Recently, I talked with some A.I. experts who mentioned this matter and were content with a particular Python package they had found that enables this sort of anonymization while maintaining 65-70% of the information at hand. Also, I've come up with a Julia-based solution to the problem a couple of weeks back, all while maintaining a much larger proportion of the information in the data. So, solutions to the anonymization problem exist, if you know how to deal with the data.
Dealing with the data so that you maintain the bulk of the information while making it anonymous isn’t easy. The idea is to create new data the resembles closely the original and use that one instead. If this process is done in a stochastic manner (i.e. using randomness in a controlled way), the process is impossible to reverse. In other words, no matter how intelligent the hacker, as in the case of an A.I. one, going back to the original data is not possible. This is not an issue for the data scientist analyzing this data since what she works with is the information in that data (aka the signal), which is retained to a large extent making the anonymized data as valuable as the original data more or less.
The best part about this process is that it’s applicable everywhere, across all domains where data science is usable. In every such project, data is eventually transformed into numbers, so regardless of the domain it comes from, it’s possible to secure it in terms of privacy with the aforementioned anonymization processes. Also, it doesn't matter what applications you plan to do with this data since all this takes place in the preprocessing stage of the analysis, which is before the actual modelling part.
Now you have the option to perform anonymization to the data at hand, without having to worry about a hacker A.I. compromising it in terms of privacy. You just need to find a data scientist who is adept at this process (ahem!). So, if you have a proof-of-concept project in mind, you can carry it out even with someone outside your organization using anonymized data for it. This can open up new possibilities of deriving value from the data at hand, without jeopardizing the privacy of the people involved in that data. So, perhaps ethical use of data is not so far-fetched as a concept!
PS – This is the kind of article I would normally publish on my data science and A.I. blog, FoxyDataScience.com. This time I decided to publish it here as it’s easier for real people to comment (blogs tend to get more SEO leeches and other spammers). If you enjoy this article, consider visiting my 100% ad-free blog and checking out my other educational material. Cheers!
Source: pixabay.com (after some processing work)Th ...
Source: Semantix Brasil · I generally don't opt fo ...
Non hai gruppi che si adattano alla tua ricerca