[RL] VPG

Vanilla Policy Gradient

  1. Exchange derivative and integral operator for a well defined function.
  2. Re-parameterization trick: Keep the probability density function, so that we can convert the equation back to expectation form, which later can be approximated by sampling in practice.

Generalized Advantage Estimate

  1. Suppose x is a random variable and its probability density is parameterized by θ, P(x|θ).
  2. Then P(x|θ), log P(x|θ), ▽log P(x|θ) are all random variable.
  3. The expectation of ▽log P(x|θ) is zero, i.e. E(▽log P(x|θ)) = ∫P(x|θ)▽log P(x|θ) = ∫▽P(x|θ) = ▽ ∫P(x|θ) = 0.
  4. Any value ɸ, if independent to random variable x, we have ɸE(▽log P(x|θ)) = E(▽log P(x|θ)ɸ) = 0.

Algorithm

Source: https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#

Leave a Reply

Your email address will not be published. Required fields are marked *