Reinforcement learning involves an agent learning how to behave through trial-and-error interactions with an environment. The agent receives rewards or punishments that influence the agent's actions without being explicitly told which actions to take. There are two main approaches: model-based learns a model of the environment and uses it to derive an optimal policy, while model-free derives a policy without learning the environment model. Exploration vs exploitation tradeoff involves balancing gaining rewards with exploring to improve long-term learning. Methods like Q-learning and genetic algorithms are used to generalize to large state/action spaces.