[RL] TRPO

Trust region policy optimization algorithm.

Step1: Derive the policy improvement

The policy improvement from old policy to new policy is the discounted advantage value derived from old policy but using the episodes coming from new policy.

Step2: Optimize a surrogate objective

The real objective of policy optimization is to maximize the expected discounted reward of a policy. But the real objective is hard to optimize, we derive a surrogate objective and optimize the surrogate objective.

The surrogate objective has some special properties and when we choose a new policy by the surrogate objective, we will obtain some bound for the real objective.

Step3: Extend to general stochastic policy

By optimizing the surrogate objective in step2, we already get a policy iteration algorithm. new_policy = (1-a) * old_policy + a * optimal_policy_for_surrogate_objetive. By selecting the correct a, we can ensure the policy improvement to be positive, i.e. the expected return of new policy is guaranteed to be higher.

But iterating by mixture policies is not general enough. The paper is able to extend the lower bound of policy improvement in terms of a new general stochastic policy instead of a mixture policy. Therefore, we have a new policy iteration algorithm which updates policy in the general stochastic policy space, not in the restricted mixture policy space.

Leave a Reply

Your email address will not be published. Required fields are marked *