I’m looking to find a way to get a dataset, extract some key statistical descriptors (variances, covariances, means and such) and use them to generate a new dataset that can be as big as I want and also keep total anonymity. This is not easily done, nor ready yet!
A cool concept I discovered while looking into dataset masking is k-anonimity. While is not directly involved in what I’m trying to do, it’s still really interesting to understand it.
In order to be able to identify someone from a dataset, there are
hard identifiers, like email, full name, or anything that can directly point to a specific person; and there are
soft idenfiers, useless on their own, but able to single out an individual when taken as a group.
When trying to make a dataset anonymous, usually hard identifiers are simply removed (or replaced interely), while one can use ranges instead of precise values for soft identifiers, such as using a range of ages instead of the real age.
This is not enough, depending on the dataset and the work done on it while anonymizing. And the k-anonimity value is an indication of how much strong the anonymization is: its value is the minimum number of times each combination of the soft identifiers appears.
If even one combination appears only once, that person can be singled out by being the only one with that combination of anonymized soft idenfifiers.
The first step in retrieving information on the dataset that can be used to replicate it is getting columns information. In order to do this the use of
pandas library is extremely useful.
I also just found out about
infer_objects, a method belonging to
Dataframe. It’s used to allow the dataframe to strongly infer the type of each column. This can work better than the naive assignation that pandas gives.
There is a fantastic python library with this name, but I will talk about it some other time. This time I’ll explain why also the rock lime is amazing.
Lime is calcium hydroxide,
Ca(OH)2. It’s an alkaline compound, and it was used as a cement even by romans thousands of years ago.
The way it’s produced and works it’s what made it really interesting for me, since its life is a cycle: first limestone (
CaCO3) is taken and burnt at high temperatures. This releases a molecule of carbon dioxide, CO2, leaving CaO: this is a strongly reactive and alkaline molecule, that reacts in water by bonding a water molecule and finally getting to Ca(OH)2.
The last part is the one where lime is finally used (usually with sand or some other inert additional ingredient) to cement together something, like bricks. The lime, once in position, loses its water molecule but regains a carbon dioxide, that is in a watery solution as carbonic acid. This will bring it back to be limestone.