1 min readJan 31, 2020
Hi Robert,
Nice question! The lesser reward for p1 is telling p1 to be more aggressive, and make it to seek to win rather than stay on draw. And in the end we are using p1 to play against human player, but setting to 0.5 should also work.
Need to notice that reward can shape an agent behaviour drastically in some cases, check out here https://towardsdatascience.com/reinforcement-learning-cliff-walking-implementation-e40ce98418d4.