We defined our state space as a 10 by 10 grid, which gives us a hundred states to be in. As actions we defined up, down, left, right and stand with which our agent can move through the grid. All but four cells of the grid are defined as dangerous and the other four as safe. As the optimal behaviour we expect our agent to go as fast as he can to safety and stay there.
In a real world setting you can compare the task to going to emergency exits in a building.
Here is what our grid looked like:
The Q function maps state, action pairs to values, which if correct can be used to find the best action for the state that you are in. As a backend for our Q function we used a mongoDB where we store state, action, value documents like this (states are numpy arrays):
We defined a run as the agent performing ten actions on the grid and for the training we had it perform a thousand runs each time starting from a random state. We trained our Q function using the value iteration update as described here. As a loss we defined the sum over all rewards from all possible runs, meaning a run for each initial state on the grid. Here is how our loss improved with the training:document = {'state': state.tostring(), 'action': action, 'value' : value}
Once the loss reached its minimum our agent had learned the optimal behaviour in case of danger in our grid. We computed everything using a jupyter notebook on our github.


Keine Kommentare:
Kommentar veröffentlichen