**2.3 Reinforcement learning**

The enormous amount of data that models need to train on is a recurring problem in machine learning. A model may need more data the more complicated it is. After all of this, the information received might not be trustworthy. It could be inaccurate, missing, or compiled from unreliable sources. Data acquisition is solved through Reinforcement Learning, which virtually eliminates the requirement for data.

A subfield of machine learning called reinforcement learning develops a model's ability to solve issues at their best on its own. With reinforcement learning, a computer learning model must analyse the issue and find the best solution on its own. This implies that we also come up with quick and original solutions that the programmer might not have even considered. A certain class of problems, such as those in robotics, gaming, and other long-term endeavours, are solved using RL.

Key principles of reinforcement learning


• The agent must investigate the stochastic environment in order to maximise positive rewards.

Reinforcement learning's advantages


The drawbacks of reinforcement learning


Algorithms for reinforcement learning can be used by robots to teach themselves to walk. The mainly used algorithm in reinforcement learning is: **Q learning.**

## *2.3.1 Q learning*

Given the agent's present state, Q-learning [21] is a model-free, off-policy reinforcement learning technique that will determine the appropriate course of action. The agent will choose what to do next base on its location in the surroundings. The model's goal is to determine the optimum course of action given the situation as it is. In order to accomplish this, it might devise its own set of rules or might deviate from the prescribed course of action. This indicates that there is no real need for a policy, which is why it is referred to as off-policy. Model-free refers to the use of predictions about the environment's anticipated response by the agent in order to make decisions. It instead relies on trial and error learning rather than the reward system.

A recommendation system for advertisements is an illustration of Q-learning. The advertisements you see in a typical ad suggestion system are determined by your past purchases or websites you may have visited. In the event that you have already purchased a TV, various brand TVs will be suggested. A distributed architecture such as a wireless sensor network, where each node conducts activities that are anticipated to optimise its long-term benefits can readily implement this algorithm.
