Snake Game AI

I started a personal project to learn Reinforcement Learning. Aim was to develop an Agent which could achieve perfect score in a Snake game. This is a path finding problem and can be solved with algorithms like A*.
After learning the basics of RL, I started experimenting with OpenAI gym environments. I learned about gym spaces and their usage in defining action space and observation space for a certain environment. I experimented with RL algorithms like A2C, PPO, DDPG, DQN etc.

CartPole

Acrobot

Pendulum

Mountain Car

Bipedal Walker

Lunar Lander

Then I developed a custom Snake game gym environment with discrete action space and box observation space. After experimenting with different observations, following observation given good results which consisting of snake head coordinates, difference between snake and apple coordinates, snake length, and list of previous snake actions.
observation = [snake_head_x, snake_head_y, apple_delta_x, apple_delta_y, snake_length] + list(self.prev_actions)

I worked on different reward function to improve the performance of the agent. I started with simple reward function where agent was receiving reward for eating apple and penalty for dying. As agent was not able to train due to Sparse nature of reward, I experimented with following reward functions with small rewards for each step taken by agent and big apple reward for eating apples.
1.self.total_reward = ((100 – euclidean_dist_to_apple) + apple_reward)/100
   Rewarding agent for getting closer to apple in each step (maximum 100 steps required to cover whole game grid)
2.self.total_reward = ((100 – self.steps_taken) / 200) + apple_reward
   Rewarding agent for surviving the game, reducing nature of step reward to motivate agent to eat apple
3.self.total_reward = (((100 – self.steps_taken) / 200 ) + apple_reward)* (self.snake_lenth)
   Increase in agent reward with increase in snake length as game becomes difficult with increase in snake length.
4.self.total_reward = ((((100*self.snake_lenth) – self.steps_taken) / (200*self.snake_lenth)) + apple_reward )*(self.snake_length)
   Increase in agent reward and spread over which reward is given with increase in snake length.

With above reward engineering the performance of agent was improving up to certain score with observation having few previous steps of an agent. When observation size increased in order to train a perfect scoring agent, performance of the agent did not improve. After this, whole image was used as a observation. With this new observation following variations were implemented with 2,4 and 8 frame stacking variants.

“Simple” game image observation

Observation with “Turning features” to have better representation of path followed

Agent will have stable “Perspective” as snake head will remain always at the center

As the RL algorithms are sample inefficient and agent was slowly iterating though episodes with these image rendered observations, I have configured environment as vectorized environment which could run in parallel.

8 Environments running in Parallel

As the observation was images, I implemented ResNet-18 as a feature extractor with ‘Perspective’ type observation. Agent trained with this configuration using PPO algorithm was able to achieve perfect score! (after few days of training…)

Perfect Scoring Snake Game AI!