Some Fun With Data: Q-learning to escape the danger

There are three subfields of machine learning. When you have data and labels you can train models using supervised learning e.g. neural networks. When you just have data but no labels you can use unsupervised learning e.g. clustering. And finally there is reinforcement learning which you can do when you have neither data nor labels. Reinforcement learning is a bit different then the other two since it isn't widely used in data mining projects. Instead you may have heard about it from projects about self trained agents that can play chess, go or video games. The famous company Deepmind has developed an algorithm called deep Q-learning for that, which combines regular Q-learning with neural networks. For our project we thought of a small problem that we solve using regular Q-learning.

We defined our state space as a 10 by 10 grid, which gives us a hundred states to be in. As actions we defined up, down, left, right and stand with which our agent can move through the grid. All but four cells of the grid are defined as dangerous and the other four as safe. As the optimal behaviour we expect our agent to go as fast as he can to safety and stay there.
In a real world setting you can compare the task to going to emergency exits in a building.
Here is what our grid looked like:

The Q function maps state, action pairs to values, which if correct can be used to find the best action for the state that you are in. As a backend for our Q function we used a mongoDB where we store state, action, value documents like this (states are numpy arrays):

document = {'state': state.tostring(),
            'action': action,
            'value' : value}

We defined a run as the agent performing ten actions on the grid and for the training we had it perform a thousand runs each time starting from a random state. We trained our Q function using the value iteration update as described here. As a loss we defined the sum over all rewards from all possible runs, meaning a run for each initial state on the grid. Here is how our loss improved with the training:

Once the loss reached its minimum our agent had learned the optimal behaviour in case of danger in our grid. We computed everything using a jupyter notebook on our github.

Some Fun With Data

Dienstag, 27. März 2018

Q-learning to escape the danger

Keine Kommentare:

Kommentar veröffentlichen