TheJach.com

Jach's personal blog

(Largely containing a mind-dump to myselves: past, present, and future)
Current favorite quote: "Supposedly smart people are weirdly ignorant of Bayes' Rule." William B Vogt, 2010

Automated anonymous surveying

Jonathan Blow was recently quoted in media as saying: "...piracy rates for PC games are often 85-90 percent. That's true. If 10 percent of people who pirate games would buy the games, that would double profits. Double! That's insane. That's the difference between starving to death and being comfortable enough to make the next game." This bugged me for a few reasons, and this from someone who never pirates games.

First check: does the math make sense? (Skip to the last parenthetical, it sort of does.) If you sell your game for $10, and get 100 customers, you've made $1000. But if the piracy rate means that if you track the count of legit users and track the count of pirate users (assuming none overlap, I'll get to that) you should see around 85-90 pirate users per 100 legit users. In other words, another $850-$900 in missing sales. If just 10 percent of those 85-90, 8.5-9, we'll round to 9, bought the game, that would result in an increase in sales by $90, bringing the total to $1090. This is nowhere near "double" revenue, but can it be double profit? Maybe I'm misunderstanding what he means by his whole remark -- perhaps he means for his game in particular? But he hasn't made a profit yet, so that seems doubtful. The only way the statement could be true is if the game cost $910 to make. If that is true, then at 100 sales, you've made $90. And if 10% of the pirate users paid, you've made another $90, doubling your profits. But this doesn't hold for any further periods of time. If after the game has been around for a while, you have made 1000 sales total (and there are now 900 pirates), you have made $10,000 in total sales, and a total profit of $8,090. Now assume 10% of those pirates now pay, or 90 users, that would net you an additional $900 in profit. This is far short of double profit. So his statement makes no sense mathematically, at least to me. (Okay, let's try one more time... Let's suppose that a 90% piracy rate means that if there are 100 copies of a game out there, 90 of them are pirated, and only 10 of them are legit individual sales. Look at 1000 copies out there, only 100 legit, total sales is thus $1000, let's say the game cost $100 to make, so profit is $900. If 90/900 pirates bought, that's an extra $900, so double profit. As you increase the number of copies, or take the cost-to-create to $0, the limit is actually 1.9 though, not strictly double. I assume this is what was meant.)

Second check: you're ignoring the possibility that 10% of people who pirate games haven't also already bought your game, before or after pirating. If this possibility is true, and if we also assume the remark is true (in whatever way), then if you waved a magic wand to suddenly get rid of piracy, your profits could halve!

This is all fun speculation, of course. If we ignore pirates who pirate after buying (perhaps it's the simplest way to transfer a copy of a game they already bought to another machine, much like pirating shows that are available on your Netflix/Amazon Prime subscription is often easier than actually using their clunky interfaces and being at the mercy of the Streaming Network Gods), then maybe 0% of pirates convert to buys after pirating. Maybe it's 10%. Maybe it's 50%! I don't know. I am strongly disinclined to believe the 85-90% of all copies figure without citation, also not too inclined to believe (even ignoring the nonsensical doubling statement if you take this interpretation) the less extreme interpretation that is at least a plausibly conservative estimate of the number of pirated copies equalling 80-90% of the number of sold copies, and I'm too lazy to look up the research myself (a commenter helpfully posted https://software.intel.com/en-us/blogs/2012/09/22/gaming-piracy-separating-fact-from-fiction, yeah, the 90% figure is BS), which is probably full of methodological errors anyway. But is there a way a game programmer could get this data, as accurately as possible, at least for their own game? I thought about it, and came up with a scheme based on simply asking the player rather than any sort of convoluted DRM that players will eventually find a way around anyway.

Sometime during gameplay, after enough time has elapsed that I think the player is enjoying the game and either in or entering its mid-stages, and only during a non-peak time (like the player being at the main menu, or pausing the game and being idle, or just being idle) my game pops up an unintrusive dialog box: "Do you want to take a short (< 2 mins) survey about the game?" I don't care about "no" responses, but you could apply the same sort of scheme for what happens in the "yes" case to record the responses, since the "do you want to take the survey?" is itself a question on the survey, with a "no" response indicating "N/A" for all other questions.

If they pick yes, they are presented with whatever survey questions I want to take (hopefully one at a time). They are told that this survey data is anonymous. They are given a link to instructions for nerds to verify that the survey is indeed anonymous since they can check with net monitoring tools and the game's log files that this scheme does what it says. If the player can feel confident that the survey is anonymous, they're less likely to lie, especially about the big question regarding piracy. I might also offer them some sort of guarantee that the survey will not brick the software if they answer something in a way they think I might not like.

Among the survey questions should be this question tree:


"Are you currently using a pirated copy of this game?"
|--If yes: "Do you intend to buy (perhaps on a sale), ask for, trade, or otherwise acquire a legal copy of this game sometime in the future?"
|--If no: "Thank you. Have you ever used a pirated copy of this game?"
|--If yes: "Did you take this survey in that version of the game?"
|--If yes: "Did you answer 'yes' when asked if you were using a pirated version of the game?"
|--If yes: "Did you answer 'yes' when asked if you intended to acquire a legal copy of this game at a later date?"
|(Answers: Yes, No, Don't Remember)


Now here's the fun part, anonymous survey submission.

There are two problems with just sending this data to a collector on my game's website. The first problem is technical: how can I stop spammers from flooding this collector with bogus results, or just plain DDoSing it? The second problem is that it's not really anonymous -- you have exposed an IP address to me! Now we all know IP != identity, the player may have pirated on one IP and submitted the form on another, but for most pirates, they aren't that careful, and so there's no guarantee that I'm not taking this survey data, taking all the "yes" responses to the "are you using a pirated copy?" question, and sending the IP and timestamp to that user's ISP to extort some monies. Heck, I could even just take all the IPs from the survey responses regardless of answers and correlate them with my own list of observed seed IPs in a torrent swarm of the game.

You may have been thinking "just submit to Google Forms, they don't give the IP to the form creator so only Google would have that data, and Google is trustworthy." But that has the same technical problem as #1 above. What's to stop someone from figuring out the submission URL and writing a bash script to submit bogus data, rate limited only by Google's rate limiter which is probably pretty generous and in any case won't help against a botnet/tor?

My proposed scheme solves both of these problems at once. The solution is inspired by Bitcoin's use of proof-of-work, and by the historical use of places like alt.anonymous.messages used as a nymserver, where anyone posts (with IPs either obfuscated by the poster or not publicly known or not guaranteed to match the originator) and everyone reads looking for messages posted at them.

At a high level, there are just two network requests, one to GET data before submitting the survey, and one to POST data containing the encrypted survey data itself plus some extra bits. I must not have server log access to either of the endpoints -- someone will have the IPs, but that someone will not be me. In other words, the survey data is anonymous to me, if I don't collude with people. So it's important I have no affiliation with the operators, or even better if the endpoint is something like the bitcoin relay network and when you receive a new message you can't be sure if the IP you received it from originated it or not. I'm actually a bit at a loss for what a viable alternative to just using alt.anonymous.messages would be, I don't know (and can't seem to find) anything that's functionally the same as pastebin but makes it easy to stream new messages. I was thinking one of the -chans, but board admins can see IPs.

For my proof of concept, the GET request would just query a public twitter account's twitter feed, and get the most recent tweet. This most recent tweet will contain two pieces of data: a number corresponding to the difficulty of the proof of work needed for the followup POST request, and some other number that should be treated as random. To simplify the explanation, the difficulty just corresponds to telling the client they need to find a sha256 hash of "the random number + a self-generated random number" whose numerical value is less than the current target value specified by the difficulty. (Difficulty = max target / current target.) Or if you represent the hash in fixed width hex form, then the difficulty approximately says how many leading 0s there must be.

After receiving the difficulty number + random number, the game spins off a thread that works on finding a valid hash of number+random stuff, at leisure to not impact the game experience. That means that on my very own hardware, if I set the difficulty such that it takes on average one minute to find a valid hash with special software using 100% of my GPU, it will take substantially longer than one minute average to find a hash if the game code is trying to find it, because the game isn't devoting 100% of the CPU or GPU to the task. Even on a loading screen or a pause menu it'd be rude to take up 100% even though you don't have to worry about framerate. So that one minute time would need to be determined with more rationale, it's just an example.

When a valid hash is found, the game makes one POST request to some server (let's say AAM for now), containing a PGP encrypted message to a public key I bundle with the game. The content of the message is the found hash, the client-generated random data used to find the hash, and the survey data. The public key and raw data are logged locally so that nerds can verify the outgoing network data matches the PGP encryption and I didn't sneak anything else in there. And that's it, as far as the client is concerned.

On my side, I'm occasionally polling this message board and looking for new messages encrypted to my public key and having a max size length. When I find one, I decrypt it. Then I check that indeed sha256(random number + included random bits) equals the hash you gave me and is less than the target hash. If it's all good, I record the survey data for later analysis, and then I post a new tweet with the same difficulty (unless I feel the need to adjust it) and a new random value. (Which may just be something like sha256("my super secret key" + unix time), or maybe even some quantum bits.) In case multiple game users are solving against the same tweet, I check the hash against the last n tweets up to a time limit, and only update the twitter feed if a hash comes in that matches the most recent tweet criteria. (And if one is coming in at close to the max speed, it's time to raise the difficulty and lower the last-n-tweets acceptance range, possibly as much as just the last tweet. Depending on the server, it might even be worthwhile to have it immediately drop connections whose first real data packet doesn't contain bits+hash to check as a DDoS mitigation...)

Let's address a few potential weaknesses or issues with the scheme.

Can the client know that its POST was picked up? This can be done by having the twitter account feed include the sha256 hash of the last survey data, and always making a new tweet when valid data comes in (but not necessarily changing the difficulty or the random number). This results in the survey results being public, but still anonymous. The client can simply poll the twitter feed after it sends its POST and see if any recent tweets have the same survey data. While it's possible the survey data of another client matches exactly, and theirs went through while yours didn't (false positive), this seems somewhat unlikely and in any case could also be mitigated by adding some "randomly generated" data tacked on at the end by the client. (Perhaps the publicly shared has is actually the sha(generated random value for other hash + survey data)? I like that approach.) If the client doesn't see its data hash in a certain amount of time, it can schedule a recompute and send.

Twitter has your GET IP, but unless I become an employee at Twitter (albeit I do know an employee at Twitter...) I'm unlikely to even have a chance at getting this. To make even that possible IP leak less likely, the GET request might be multiple GET requests at various times, with only one of them actually being acted upon at random and at a random point in the future. Additionally the GET request might happen randomly in the game even without the survey having been presented yet. This would prevent anyone from correlating the IP-timestamp of a GET to a timestamp of a POST. (Remember we're concerned with protecting the anonymity of the POST, and hence the POST's IP.)

The proof of work hash algorithm should obviously be scrypt, but sha256 is easier to explain for me.

The difficulty should only be increased if I'm getting results for the most recent difficulty target at the minimum average speed for a good GPU, indicating someone made a custom program to try and feed my survey wrong data. Another possible indicator of abuse is by comparing incoming POSTs to the number of active users as told by SteamCharts.

The game should resume its calculating if the game was closed and reopened but the survey hadn't been sent yet because a solution hadn't been found yet.

In the event of network (or other) errors, even ones that persist despite exponential backoff retry attempts (for instance, maybe the twitter account got banned and it's 3am so I haven't noticed...), the game can always ask the user if they want to send the data directly to me non-anonymously, or if they want to try again maybe when the next game update is shipped, or if they just want to forget about it.


Posted on 2016-02-03 by Jach

Tags: intellectual property, philosophy, programming

Permalink: https://www.thejach.com/view/id/318

Trackback URL: https://www.thejach.com/view/2016/2/automated_anonymous_surveying

Back to the top

Back to the first comment

Comment using the form below

(Only if you want to be notified of further responses, never displayed.)

Your Comment:

LaTeX allowed in comments, use $$\$\$...\$\$$$ to wrap inline and $$[math]...[/math]$$ to wrap blocks.