Suppose that Pac-Man is in the smallGrid world with just two pieces of food. On every time step his living reward is -1. When he eats a piece of food he gets an additional reward of 10. If he manages to eat all the food he gets a bonus reward of 500. Given this setup, here is one example reward history he could have:
self.rewards = [-1, -1, 9, -1, -1, 509]Then his discounted reward for each step would be calculated as follows (assuming the discount is d):
discountedReward5 = 509 discountedReward4 = -1 + d*509 discountedReward3 = -1 + d*-1 + d2*509 discountedReward2 = 9 + d*-1 + d2*-1 + d3*509 discountedReward1 = -1 + d*9 + d2*-1 + d3*-1 + d4*509 discountedReward0 = -1 + d*-1 + d2*9 + d3*-1 + d4*-1 + d5*509Each of these discounted rewards can be written in terms of the previous one like this:
discountedReward5 = 509 discountedReward4 = -1 + d * discountedReward5 discountedReward3 = -1 + d * discountedReward4 discountedReward2 = 9 + d * discountedReward3 discountedReward1 = -1 + d * discountedReward2 discountedReward0 = -1 + d * discountedReward1In general the calculation is:
discountedRewardt = R(t) + d * discountedRewardt+1Thus if you iterate over the self.rewards list in reverse order, you can easily calculate the discounted reward for each time step. For this particular example the discounted rewards would be:
discountedReward5 = 509 discountedReward4 = 457 discountedReward3 = 410 discountedReward2 = 378 discountedReward1 = 340 discountedReward0 = 305