Consider the following policy-search algorithm for a multi-armed binary bandit:
∀a, πt+1(a)=πt(a)(1−α)+α(1a=atRt+(1−1a=at)(1−Rt))
where 1at=a is 1 if a=at and 0 otherwise. Which of the following is true for the above algorithm?
1. It is LR−I algorithm
2. It is LR−ϵP algorithm
3. It would work well if the best arm had probability of 0.9 of resulting in +1 reward and the next best arm had probability of 0.5 of resulting in +1 reward
4. It would work well if the best arm had probability of 0.3 of resulting in +1 reward and the worst arm had probability of 0.25 of resulting in +1 reward
Answers
Answered by
2
Answer:
Pata nhi
Explanation:
Answered by
0
Answer:
3
Explanation:
The equation is a L r-p algorithm so it would work well if the best arm had probability of 0.9 of resulting in +1 reward and the next best arm had probability of 0.6 of resulting in +1 reward.
3
Explanation:
The equation is a L r-p algorithm so it would work well if the best arm had probability of 0.9 of resulting in +1 reward and the next best arm had probability of 0.6 of resulting in +1 reward.
Similar questions