Imperial College London > Talks@ee.imperial > Featured talks > ONLINE LEARNING WITH GAUSSIAN PAYOFFS AND SIDE OBSERVATIONS

ONLINE LEARNING WITH GAUSSIAN PAYOFFS AND SIDE OBSERVATIONS

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Patrick Kelly.

In many learning problems the best results can only be achieved if the learning agent learns online, from experience while interacting with its environment. At the heart of designing agents for this problem is the so-called exploration-exploitation dilemma, the dilemma faced by a learning agent that has to weigh the benefit of receiving reward against the benefit of receiving information. While in some simple settings, like in the so-called bandit problems, during the last decades we reached a near-complete understanding of how to best resolve this dilemma, the question eludes us in even just slightly more complicated scenarios. In this talk, in addition to covering a bit of the core results known today, I will consider a variant of simple bandit problems: sequential learning with Gaussian payoffs and side observations with unknown means and known variances. In this problem, an agent wants to collect as much reward as possible by choosing actions from a fixed finite set in a sequential fashion. Pulling an action makes the agent incur a random reward, which the agent observes together with observations for all the other actions (that it did not choose). These side-observations share the same mean as the random reward the agent would receive if it chose those respective actions. By changing the relative magnitudes of the variances of the observations and their relation to the length of the learning horizon, this learning setting allows one to continuously interpolate between learning under full information when no exploration is necessary through hard partial monitoring where explicit exploration is necessary to (uninteresting) impossible problems. In this talk, I will describe what we know and what we don’t know about this learning setup. In particular, while doing this, as far as I know, for the first time in the literature, we established tight, non-asymptotic, problem-dependent lower bounds on the regret of any algorithm, the proof technique of which may be of interest on its own.

Based on joint work with Yifan Wu and Andras Gyorgy.

About the presenter:

Csaba Szepesvari (PhD’99) is currently a Professor at the the Department of Computing Science of the University of Alberta and a Principal Investigator of the Alberta Innovates Center for Machine Learning, before which he was a senior researcher of the Computer and Automation Research Institute of the Hungarian Academy of sciences and held various industrial positions. The coauthor of a book on nonlinear approximate adaptive controllers and the author of a short book on Reinforcement Learning, he published about 150 journal and conference papers. He is best known for the UCT algorithm, which led to a leap in the performance of planning and search algorithms in many domains, in particular in computer go. He is an Action Editor of the Journal of Machine Learning Research and the Machine Learning Journal. His research interests include reinforcement learning, statistical learning theory and online learning.

This talk is part of the Featured talks series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

Changes to Talks@imperial | Privacy and Publicity