CabAnon: progress update and some code
Rédigé par Vincent Toubiana
-18 octobre 2017
Previously, we announced that we were starting the Cabanon project, in order to evaluate when and where anonymized data are useful. Today, we explain how we anonymized the dataset and we share the script used so far. We will be happy to receive your feedback and any information on use cases that you may build with those data.
Getting the files
First things first, you have to get the right files. The NYC taxi trip dataset was initially released after Chris WONG sent a FOIA request to access the data. However, this is not how we got the data. Indeed, this dataset contained data that could easily be linked back to identifiers: the hashed medallion. Furthermore, we had to make sure that we would use a clean data set and that if anyone had asked to have his trips removed from the dataset, our experiment would not include these trips (well… we have to pay attention to the right to object, right?).
Fortunately, New-York Taxi and Limousine Commission shared the cleaned data (link to the command line to get the data).
As we wanted to replace accurate location points with ZipCodes, we had to get a map of the Zip Codes. More precisely, we need shape files that map coordinates to Zip codes. These are available on the census bureau website.
We also wanted to have another mapping that was recommended by the Taxi and Limousine Commission: the Taxi-Zones. This mapping is available here, and if you have a quick look at it, you will quickly recognize NYC neighborhoods. As far as we understand, these zones are based on taxi pickup densities and are therefore more relevant than Zip Codes.
Mapping the data
We will not write an R tutorial here. We had a fairly limited use of it to find to which zone or ZipCode corresponds to a point.
Running this script takes a while, this is why we started to use our cluster (provided by terra lab).
After this first process, we ended up with files that were similar to the original one. The only difference was the granularity of the location, since we had changed accurate locations of pick-ups and drop-offs to Zip-codes (or taxi zones). This is far less accurate, but you could still retrieve someone’s trip in the dataset using the very accurate timestamps. Indeed, you would still be able to retrieve the trips of some celebrities, and event if you would not know exactly where they went, you would get some information on their destination.
While the privacy threat is reduced the dataset is still not anonymized.
Looking for k-anonymity
We wanted to achieve k-anonymity with respect to both the pick-up and the drop-off locations, so that you should not be able to distinguish a pick-up location among k others and the same goes for the drop-offs. To achieve that, we had to diminish the timestamp accuracy by rounding them up.
First we looked for each zip-codes how many taxis started their trip over n minutes. We did that for n=[5,15,30,60] and we stored this number for each trip. Similarly, we computed the number of taxis finishing their trips in this area. This was achieved with this request (for the pickup):
For each timestamp granularity, we obtained the number of taxis that had been starting/ending their trip in a zone. Based on our requirement, we could set this timestamp to match our k.
Estimating the loss
The raw dataset contained 173 176 321 trips. But to achieve k-anonymity, all the trips that started in a low density area had to be dropped because they were the only trips in these areas. Accuracy was decreased and some records were lost in this process.
Here are the tables for k=10 values and different time periods: :
- Time slot of 30 minutes: 165 373 509 (95,5%)
- Time slot of 15 minutes: 159 162 257 (91,9%)
- Time slot of 5 minutes: 141 099 441 (81,4%)
Unsurprisingly, the smaller the time stamp we use, the more trips we dropped. However, it is interesting to note that the densitiy of taxi trips was such that we did not need to drop much trips.
Assuming that we would have kept all trips, if all the five trips that had started in an area ended up in the same area, we could still know where you finished your trip knowing where it started. To address this we would have to achieve l-diversity, that is to say that for each area, at least l trips would end in different zone.
To do that, we counted the number of remaining trips starting and ending in different zones using this query:
And then, we dropped all the trips that did not meet this property. We used the settings that gave us the finest granularity and tried to set l-diversity to find an upper bound of the “loss-rate”. Therefore, we kept k=10 and set the time slot to five minutes.
- with l= 2 we kept only 79 973 385 trips (46,2%)
- with l=3 , we kept 49 956 125 trips (28,8%)
While the loss rate was high, we did not have to achieve l-diversity in this case because we dropped many trips to get k-anonymity.
Our anonymization required to drop some data and to reduce the granularity. This is not uncommon. However, it should be noted that these trips were quite isolated on a given time of the day. Depending on our requirement, the number of discarded trips varied between 5% and 60%. We’ll see in another article that this does not always reduce the usefulness of the data. .