Data Poisoning Attacks on Stochastic Bandits
Download Data Poisoning Attacks on Stochastic Bandits
Preview text
Data Poisoning Attacks on Stochastic Bandits
arXiv:1905.06494v1 [cs.LG] 16 May 2019
Fang Liu 1 Ness Shroff 1 2
Abstract
Stochastic multiarmed bandits form a class of online learning problems that have important applications in online recommendation systems, adaptive medical treatment, and many others. Even though potential attacks against these learning algorithms may hijack their behavior, causing catastrophic loss in realworld applications, little is known about adversarial attacks on bandit algorithms. In this paper, we propose a framework of ofﬂine attacks on bandit algorithms and study convex optimization based attacks on several popular bandit algorithms. We show that the attacker can force the bandit algorithm to pull a target arm with high probability by a slight manipulation of the rewards in the data. Then we study a form of online attacks on bandit algorithms and propose an adaptive attack strategy against any bandit algorithm without the knowledge of the bandit algorithm. Our adaptive attack strategy can hijack the behavior of the bandit algorithm to suffer a linear regret with only a logarithmic cost to the attacker. Our results demonstrate a signiﬁcant security threat to stochastic bandits.
1. Introduction
Understanding adversarial attacks on machine learning systems is essential to developing effective defense mechanisms and an important step toward trustworthy artiﬁcial intelligence. A class of adversarial attacks on machine learning that have received much attention is data poisoning (Biggio et al., 2012; Mei & Zhu, 2015; Xiao et al., 2015; Alfeld et al., 2016; Li et al., 2016; Wang & Chaudhuri, 2018). Here, the attacker is able to access the leaner’s training data, and has the power to manipulate a fraction of the training data in order to make the learner satisfy certain objectives. This
1Department of Electrical and Computer Engineering, 2Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio, USA. Correspondence to: Fang Liu, Ness Shroff .
Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).
is motivated by modern industrial scale applications of machine learning systems, where data collection and policy updates are done in a distributed way. Attacks can happen when the learner is not aware of the attacker’s access to the training data.
While there has been a line of research on adversarial attacks on deep learning (Goodfellow et al., 2015; Huang et al., 2017; Lin et al., 2017) and supervised learning (Biggio et al., 2012; Mei & Zhu, 2015; Xiao et al., 2015; Alfeld et al., 2016; Li et al., 2016), little is known on adversarial attacks on stochastic multiarmed bandits, which is a form of online learning with limited feedback. This is potentially hazardous since stochastic MAB are widely used in the industry to recommend news articles (Li et al., 2010), display advertisements (Chapelle et al., 2015), allocate medical treatment (Thompson, 1933) among many others. Hence, understanding the impact of adversarial attacks on bandit algorithms is an urgent yet open problem.
Recently, there has been an important piece of ofﬂine data poisoning attacks for the contextual bandit algorithm (Ma et al., 2018). They assume that the bandit algorithm updates periodically and that the attacker can manipulate the rewards in the data before the updates in order to hijack the behavior of the bandit algorithm. Consider the news recommendation as a running example for this ofﬂine attack model. A news website has K articles (i.e., arms or actions) and runs a bandit algorithm to learn a recommendation policy. Every time a user visits the website, the bandit algorithm displays an article based on historical data. Then the website receives a binary reward indicating whether the user clicks on the displayed article or not. The website keeps serving the users throughout the day and updates the bandit algorithm during the night. An attacker can perform ofﬂine data poisoning attacks before the bandit algorithm is updated. More specifically, the attacker may poison the rewards collected during the daytime and control the behavior of the bandit algorithm as it wants. The authors in (Ma et al., 2018) show that the ofﬂine attack strategy on LinUCBtype contextual bandit algorithm (Li et al., 2010) can be formulated as a convex optimization problem. However, ofﬂine attack strategies for classical bandit algorithms are still open.
In contrast to ofﬂine attacks, an online form of data poisoning attacks against bandit algorithms has also been proposed
Data Poisoning Attacks on Stochastic Bandits
by (Jun et al., 2018). They assume that the bandit algorithm updates step by step and the attacker sits inbetween the bandit environment and the bandit algorithm. At each time step, the bandit algorithm pulls an arm and the attacker eavesdrops on the decision. Then the attacker makes an attack by manipulating the reward generated from the bandit environment. The bandit algorithm receives the poisoned reward without knowing the presence of the attacker and updates accordingly. The goal of the attacker is to control which arm appears to the bandit algorithm as the best arm at the end. Efﬁcient attack strategies against greedy and Upper Conﬁdence Bounds (UCB) algorithms are proposed by (Jun et al., 2018). However, online attack strategies for other bandit algorithms (e.g., Thompson Sampling (Thompson, 1933) and UCBoost (Liu et al., 2018)) are still unknown.
In this work, we have a systematic investigation of data poisoning attacks against bandit algorithms and address the aforementioned open problems. We study the data poisoning attacks in both the ofﬂine and the online cases. In the ofﬂine setting, the bandit algorithm updates periodically and the attacker can manipulate the rewards in the collected data before the update occurs. In the online setting, the attacker eavesdrops on the bandit algorithm and manipulates the feedback reward. The goal of the attacker is that the bandit algorithm considers the target arm as the optimal arm at the end. Speciﬁcally, we make the following contributions to data poisoning attacks on stochastic bandits.
1. We propose an optimization based framework for ofﬂine attacks on bandit algorithms. Then, we instantiate three ofﬂine attack strategies against greedy, UCB algorithm and Thompson Sampling, which are the solutions of the corresponding convex optimization problems. That is, there exist efﬁcient attack strategies for the attacker against these popular bandit algorithms.
2. We study the online attacks on bandit algorithms and propose an adaptive attack strategy that can hijack any bandit algorithm without knowing the bandit algorithm. As far as we know, this is the ﬁrst negative result showing that there is no robust and good stochastic bandit algorithm that can survive online poisoning attacks.
3. We evaluate our attack strategies by numerical results. Our attack strategies are efﬁcient in forcing the bandit algorithms to pull a target arm at a relatively small cost. Our results expose a signiﬁcant security threat as bandit algorithms are widely employed in the real world applications.
More recently, there is a line of works by (Lykouris et al., 2018; Gupta et al., 2019) studying an adversarial corruption model. While they assume that the attacker has to attack all the arms before the learner chooses an arm, we consider the
case where the attacker’s strategy is aware of and adaptive to the decision of the learner. The difference leads to opposite results. While they propose a robust bandit algorithm, our work shows that the existing algorithms are vulnerable to the adversarial attacks.
This paper is organized as the following. In Section 2, we introduce the problem setting and attack system models. Then, we propose the ofﬂine attack problem in Section 3. In Section 4, we study the online attack problem and propose an adaptive attack strategy. The numerical results to evaluate our attack strategies are provided in Section 5. Finally, we conclude the paper with discussion in Section 6.
2. Problem Formulation
We consider the classical stochastic bandit setting. Suppose that there is a set A = {1, 2, . . . , K} of K arms and the bandit algorithm proceeds in discrete time t = 1, 2, . . . , T . At each round t, the algorithm pulls an arm at ∈ A and the bandit environment generates a reward rt ∈ R such that
rt = µat + ηt,
(1)
where ηt is a zeromean, σsubGaussian noise and µat is the instantaneous reward at time t. In other words, the expected reward of pulling arm a is µa. Note that {µa}a∈A
are unknown to the bandit algorithm and the attacker.
The performance of the bandit algorithm is measured by
the regret, which is the expected difference between the
total rewards obtained by an oracle that always pulls the
optimal arm (i.e., the arm with the largest expected reward maxa∈A µa) and the accumulative rewards collected by the bandit algorithm up to time horizon T . Formally, the regret R(T ) is given by
T
R(T ) = E max µaT − rt .
(2)
a∈A t=1
In this work, we consider the uniformly good bandit algorithm that incurs sublinear regret, i.e., R(T ) = o(T ).
For each reward rt returned from the bandit environment, the attacker manipulates the reward into
rt = rt + t.
(3)
Then the accumulated attack cost of the attacker, C(T ), is measured by the norm of the vector ( 1, . . . , T )T . For simplicity, we consider the l2norm in the ofﬂine setting and the l1norm in the online setting. Note that the results in
this work can be easily extended to any norm. For example, consider the lpnorm, the total attack cost of the attacker is
T
1/p
C(T ) =
 tp
.
(4)
t=1
Data Poisoning Attacks on Stochastic Bandits
a t < l a t e x i t s h a 1 _ b a s e 6 4 = " n 4 f Y N Z o W f + O 7 S E 0 1 l O y 5 I L Q C Y / k = " > A A A B 6 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E U G 8 F P X i s a G y h D W W z 3 b R L N 5 u w O x F K 6 E / w 4 k H F q / / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 M J X C o O t + O 6 W V 1 b X 1 j f J m Z W t 7 Z 3 e v u n / w a J J M M + 6 z R C a 6 H V L D p V D c R 4 G S t 1 P N a R x K 3 g p H 1 1 O / 9 c S 1 E Y l 6 w H H K g 5 g O l I g E o 2 i l e 9 r D X r X m 1 t 0 Z y D L x C l K D A s 1 e 9 a v b T 1 g W c 4 V M U m M 6 n p t i k F O N g k k + q X Q z w 1 P K R n T A O 5 Y q G n M T 5 L N T J + T E K n 0 S J d q W Q j J T f 0 / k N D Z m H I e 2 M 6 Y 4 N I v e V P z P 6 2 Q Y X Q a 5 U G m G X L H 5 o i i T B B M y / Z v 0 h e Y M 5 d g S y r S w t x I 2 p J o y t O l U b A j e 4 s v L x D + r X 9 X d u / N a 4 6 Z I o w x H c A y n 4 M E F N O A W m u A D g w E 8 w y u 8 O d J 5 c d 6 d j 3 l r y S l m D u E P n M 8 f v B O N o Q = = < / l a t e x i t >
bandit algorithm
attacker
{~✏a < l a t e x i t s h a 1 _ b a s e 6 4 = " m p c O l G B I d M l i J g v Q 3 6 w b d h K 6 G A Q = " > A A A B / H i c b V B N S 8 N A E N 3 U r 1 q / 4 s f N S 7 A I n k o q g n o r e v F Y w W i h C W G z n b R L N 7 t h d 1 O o I f h X v H h Q 8 e o P 8 e a / c d v m o K 0 P B h 7 v z T A z L 0 o Z V d p 1 v 6 3 K 0 v L K 6 l p 1 v b a x u b W 9 Y + / u 3 S u R S Q I e E U z I T o Q V M M r B 0 1 Q z 6 K Q S c B I x e I i G 1 x P / Y Q R S U c H v 9 D i F I M F 9 T m N K s D Z S a B / 4 u T 8 C k v u Q K s o E L 0 L s F 6 F d d x v u F M 4 i a Z a k j k q 0 Q / v L 7 w m S J c A 1 Y V i p b t N N d Z B j q S l h U N T 8 T E G K y R D 3 o W s o x w m o I J 9 e X z j H R u k 5 s Z C m u H a m 6 u + J H C d K j Z P I d C Z Y D 9 S 8 N x H / 8 7 q Z j i + C n P I 0 0 8 D J b F G c M U c L Z x K F 0 6 M S i G Z j Q z C R 1 N z q k A G W m G g T W M 2 E 0 J x / e Z F 4 p 4 3 L h n t 7 V m 9 d l W l U 0 S E 6 Q i e o i c 5 R C 9 2 g N v I Q Q Y / o G b 2 i N + v J e r H e r Y 9 Z a 8 U q Z / b R H 1 i f P 8 H 1 l Z 4 = < / l a t e x i t >
}
data buffer of size T
bandit environment
( a t < l a t e x i t s h a 1 _ b a s e 6 4 = " r 1 y + / y t P O g S m 2 s V t U 7 + i 0 7 T 0 + N E = " > A A A B 7 3 i c b V B N S 8 N A E N 3 4 W e t X 1 a O X x S J U k J K I o N 4 K e v B Y w d h K G 8 J m u 2 m X 7 m 7 C 7 k Q o o b / C i w c V r / 4 d b / 4 b t 2 0 O 2 v p g 4 P H e D D P z o l R w A 6 7 7 7 S w t r 6 y u r Z c 2 y p t b 2 z u 7 l b 3 9 B 5 N k m j K f J i L R 7 Y g Y J r h i P n A Q r J 1 q R m Q k W C s a X k / 8 1 h P T h i f q H k Y p C y T p K x 5 z S s B K j z U S w q k O 4 S S s V N 2 6 O w V e J F 5 B q q h A M 6 x 8 d X s J z S R T Q A U x p u O 5 K Q Q 5 0 c C p Y O N y N z M s J X R I + q x j q S K S m S C f H j z G x 1 b p 4 T j R t h T g q f p 7 I i f S m J G M b K c k M D D z 3 k T 8 z + t k E F 8 G O V d p B k z R 2 a I 4 E x g S P P k e 9 7 h m F M T I E k I 1 t 7 d i O i C a U L A Z l W 0 I 3 v z L i 8 Q / q 1 / V 3 b v z a u O m S K O E D t E R q i E P X a A G u k V N 5 C O K J H p G r + j N 0 c 6 L 8 + 5 8 z F q X n G L m A P 2 B 8 / k D T J m P n w = = < / l a t e x i t >
,
rt)
a t < l a t e x i t s h a 1 _ b a s e 6 4 = " n 4 f Y N Z o W f + O 7 S E 0 1 l O y 5 I L Q C Y / k = " > A A A B 6 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E U G 8 F P X i s a G y h D W W z 3 b R L N 5 u w O x F K 6 E / w 4 k H F q / / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 M J X C o O t + O 6 W V 1 b X 1 j f J m Z W t 7 Z 3 e v u n / w a J J M M + 6 z R C a 6 H V L D p V D c R 4 G S t 1 P N a R x K 3 g p H 1 1 O / 9 c S 1 E Y l 6 w H H K g 5 g O l I g E o 2 i l e 9 r D X r X m 1 t 0 Z y D L x C l K D A s 1 e 9 a v b T 1 g W c 4 V M U m M 6 n p t i k F O N g k k + q X Q z w 1 P K R n T A O 5 Y q G n M T 5 L N T J + T E K n 0 S J d q W Q j J T f 0 / k N D Z m H I e 2 M 6 Y 4 N I v e V P z P 6 2 Q Y X Q a 5 U G m G X L H 5 o i i T B B M y / Z v 0 h e Y M 5 d g S y r S w t x I 2 p J o y t O l U b A j e 4 s v L x D + r X 9 X d u / N a 4 6 Z I o w x H c A y n 4 M E F N O A W m u A D g w E 8 w y u 8 O d J 5 c d 6 d j 3 l r y S l m D u E P n M 8 f v B O N o Q = = < / l a t e x i t >
bandit algorithm
attacker
✏< l a t e x i t s h a 1 _ b a s e 6 4 = " C 0 D S r 4 2 Q Q i r t G n N m o 4 v 9 x B W / b P Q = " > A A A B 8 H i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b B U 0 l E U G 8 F P X i s Y G y x D W W z n b R L N 5 u w O x F K 6 L / w 4 k H F q z / H m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 M J X C o O t + O 6 W V 1 b X 1 j f J m Z W t 7 Z 3 e v u n / w Y J J M c / B 5 I h P d D p k B K R T 4 K F B C O 9 X A 4 l B C K x x d T / 3 W E 2 g j E n W P 4 x S C m A 2 U i A R n a K X H L q R G y E T 1 s F e t u X V 3 B r p M v I L U S I F m r / r V 7 S c 8 i 0 E h l 8 y Y j u e m G O R M o + A S J p V u Z i B l f M Q G 0 L F U s R h M k M 8 u n t A T q / R p l G h b C u l M / T 2 R s 9 i Y c R z a z p j h 0 C x 6 U / E / r 5 N h d B n k Q q U Z g u L z R V E m K S Z 0 + j 7 t C w 0 c 5 d g S x r W w t 1 I + Z J p x t C F V b A j e 4 s v L x D + r X 9 X d u / N a 4 6 Z I o 0 y O y D E 5 J R 6 5 I A 1 y S 5 r E J 5 w o 8 k x e y Z t j n B f n 3 f m Y t 5 a c Y u a Q / I H z + Q N I k Z D c < / l a t e x i t > t
bandit environment
rt0< l a t e x i t s h a 1 _ b a s e 6 4 = " M 4 U / n I c l + 4 w F d / l E w Q X + v N / 2 D t A = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b R U 0 l F U G 8 F P X i s Y G y h D W W z 3 b R L d 5 O w O x F K 6 F / w 4 k H F q 7 / I m / / G T Z u D t j 4 Y e L w 3 w 8 y 8 I J H C o O t + O 6 W V 1 b X 1 j f J m Z W t 7 Z 3 e v u n / w a O J U M + 6 x W M a 6 E 1 D D p Y i 4 h w I l 7 y S a U x V I 3 g 7 G N 7 n f f u L a i D h 6 w E n C f U W H k Q g F o 5 h L + r S P / W r N r b s z k G X S K E g N C r T 6 1 a / e I G a p 4 h E y S Y 3 p N t w E / Y x q F E z y a a W X G p 5 Q N q Z D 3 r U 0 o o o b P 5 v d O i U n V h m Q M N a 2 I i Q z 9 f d E R p U x E x X Y T k V x Z B a 9 X P z P 6 6 Y Y X v m Z i J I U e c T m i 8 J U E o x J / j g Z C M 0 Z y o k l l G l h b y V s R D V l a O O p 2 B A a i y 8 v E + + 8 f l 1 3 7 y 9 q z d s i j T I c w T G c Q Q M u o Q l 3 0 A I P G I z g G V 7 h z V H O i / P u f M x b S 0 4 x c w h / 4 H z + A D a T j e M = < / l a t e x i t > rt < l a t e x i t s h a 1 _ b a s e 6 4 = " a J W o N E K S m f Y / G u V K x + U T 0 c 8 Q N V Q = " > A A A B 6 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E U G 8 F P X i s a G y h D W W z 3 b R L N 5 u w O x F K 6 E / w 4 k H F q / / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 M J X C o O t + O 6 W V 1 b X 1 j f J m Z W t 7 Z 3 e v u n / w a J J M M + 6 z R C a 6 H V L D p V D c R 4 G S t 1 P N a R x K 3 g p H 1 1 O / 9 c S 1 E Y l 6 w H H K g 5 g O l I g E o 2 i l e 9 3 D X r X m 1 t 0 Z y D L x C l K D A s 1 e 9 a v b T 1 g W c 4 V M U m M 6 n p t i k F O N g k k + q X Q z w 1 P K R n T A O 5 Y q G n M T 5 L N T J + T E K n 0 S J d q W Q j J T f 0 / k N D Z m H I e 2 M 6 Y 4 N I v e V P z P 6 2 Q Y X Q a 5 U G m G X L H 5 o i i T B B M y / Z v 0 h e Y M 5 d g S y r S w t x I 2 p J o y t O l U b A j e 4 s v L x D + r X 9 X d u / N a 4 6 Z I o w x H c A y n 4 M E F N O A W m u A D g w E 8 w y u 8 O d J 5 c d 6 d j 3 l r y S l m D u E P n M 8 f 1 e i N s g = = < / l a t e x i t >
Figure 1: Ofﬂine attack system model
Figure 2: Online attack system model
Without loss of generality, we assume that arm a∗ is a suboptimal attack target, such that µa∗ < maxa∈A µa. The attacker’s goal is to manipulate the bandit algorithm into pulling arm a∗ frequently. To avoid being detected, the attacker also wants to keep the cost as small as possible.
2.1. Ofﬂine attack system model
The ofﬂine attack system model is illustrated in Figure 1. Besides the bandit algorithm, the bandit environment and the attacker, there is a data buffer of size T . This models the setting where updates of the bandit algorithm happen in minibatches of size T . For round t = 1, . . . , T , the bandit algorithm sequentially pulls arm at. Then the environment generates the reward rt according to Equation (1) and send the tuple (at, rt) to the data buffer. The data buffer stores the data stream until round T . At the end of round T , the attacker accesses the data buffer and poisons the data by Equation (3). Finally, the data buffer sends the poisoned data stream {(at, rt)}t≤T to the bandit algorithm and the bandit algorithm updates according to the received data without knowing the existence of the attacker.
The goal of the attacker in the ofﬂine setting is to force the bandit algorithm to pull the target arm a∗ with high probability at round T +1 (i.e., after updating with poisoned data) while incurring only a small cost. This means that the attacker wants to make the poisoning effort t as small as possible to keep stealthy.
2.2. Online attack system model
The online attack system model is illustrated in Figure 2. In the online setting, the bandit algorithm updates instantly for each time step. The attacker stealthily monitors the decision of the bandit algorithm at at each time t and poisons the reward signal returned from the bandit environment by equation (3). Then the bandit algorithm receives the manipulated reward signal rt and updates unaware of the attacker.
The goal of the attacker in the online setting is to hijack the behavior of the bandit algorithm with high probability by manipulating the reward signal so that the bandit algorithm
pulls the target arm a∗ in Θ(T ) time steps. In the meantime, the attacker wants to control its attack cost by poisoning as infrequently as possible in order to avoid being detected. Note that t = 0 is considered as no attack.
By the linearity of expectation and ηt is a zeromean noise,
R(T ) =
max µi − µa E[Na(T )], (5)
i∈A
a∈A
where Na(t) is the number of pulling arm a up to time t. Thus, the attack goal means that the attacker wants the bandit algorithm to incur a linear regret by incurring only a
sublinear attack cost.
3. Ofﬂine Attacks
In this section, we introduce the ofﬂine attack framework to stochastic bandits. The updates of the bandit algorithm happen in minibatches of size T 1. Between these consecutive updates, the bandit algorithm follows a ﬁxed algorithm obtained from the last update. This allows the attacker to poison the historical data before the update and force the algorithm to pull a target arm a∗ with high probability.
Let ma be the number of times arm a was pulled up to time T , i.e., ma := Na(T ). For each arm a ∈ A, let ya ∈ Rma be the corresponding reward vector returned from the bandit environment when arm a was pulled. That is, ya := (rt : at = a)T . Let a ∈ Rma be the poisoning attack strategy of the attacker, i.e., a := ( t : at = a)T . The poisoned reward vector for arm a after the attack becomes ya + a. To avoid being detected, the attacker hopes to make the poisoning a as small as possible. Without loss of generality, we consider the l2norm attack cost in the ofﬂine attacks. We have that
T
C(T )2 =
2 t
=
 a22.
(6)
t=1
a∈A
Thus, the attacker’s ofﬂine attack problem can be formulated 1The batch size T is a relatively large integer compared to K.
Data Poisoning Attacks on Stochastic Bandits
as the following optimization problem P ,
P : min
 a22
(7)
a:a∈A a∈A
s.t. P{aT +1 = a∗} ≥ 1 − δ,
(8)
for some error tolerance δ > 0. Note that we deﬁne the attack goal as forcing the bandit algorithm to pull arm a∗ at the next round with high probability. This is because there are some randomized bandit algorithms, such as greedy and Thompson Sampling. It is not feasible to force the randomized algorithm to pull a target arm with probability 1. But it is possible to hijack the behavior of the randomized algorithm with high probability.
Proposition 1. Given some error tolerance δ > 0. If { ∗a}a∈A is the optimal solution of problem P , then it is the optimal ofﬂine attack strategy for the attacker.
The proof of Proposition 1 follows from equation (6) and the deﬁnition of ofﬂine attacks. Note that the constraint of problem P depends on the bandit algorithm. Now, we assume that the attacker knows the bandit algorithm and we introduce algorithmspeciﬁc ofﬂine attack strategies derived from the optimization problem P for three popular bandit algorithms, greedy, UCB and Thompson Sampling. For simplicity, we denote the postattack empirical mean observed by the bandit algorithm at the end of round t as
1t
µ˜a(t) := Na(t) rτ 1{aτ = a}.
(9)
τ =1
3.1. Ofﬂine attack on greedy
Now, we derive the ofﬂine attack strategy for the greedy algorithm, which is a randomized algorithm with some decreasing rate function αt. Typically, αt = Θ(1/t). At each time t, the greedy algorithm pulls an arm
at = draw uniformly over A, w.p. αt . (10) arg maxa∈A µ˜a(t − 1), otherwise
At time step T + 1, the greedy algorithm uniformly samples an arm from the arm set A with probability αT +1, which cannot be controlled by the attacker. Then, we set the error tolerance as δ = KK−1 αT +1 since the target arm a∗ may also be pulled in the uniform sampling. Thus, the attacker poisons the reward such that the target arm a∗
has the largest empirical mean. After the attack, the greedy algorithm pulls an arm aT +1 at time T + 1 such that P{aT +1 = a∗} = 1 − δ. In order to make the target arm the unique arm with the largest empirical mean, we introduce a margin parameter ξ > 0. So the attacker’s optimization
problem for attacking greedy algorithm is the following
problem P1,
P1 : min
 a22
(11)
a:a∈A a∈A
s.t. µ˜a∗ (T ) ≥ µ˜a(T ) + ξ, ∀a = a∗, (12)
where µ˜a(T ) = (ya + a)T 1/ma and 1 is the vector that each element is 1. The condition in the above optimization
implies that the greedy algorithm will play the target arm a∗ at the time step T + 1 with probability of at least 1 − KK−1 αT +1. The following result shows that there exists at least one optimal solution of problem P1, i.e., one optimal ofﬂine attack for the greedy algorithm.
Theorem 1. Given any margin parameter ξ > 0. For any reward instance {ya}a∈A, there exists at least one optimal solution of problem P1, which is a quadratic program with linear constraints. Hence, there exists at least one optimal
ofﬂine attack for the greedy algorithm.
The proof of Theorem 1 is provided in Section A.1 in the
supplementary material. Note that the error tolerance parameter, δ = KK−1 αT +1, depends on the rate function αt of the greedy algorithm, which is not controllable by the at
tacker. This counts the exploration introduced by the bandit
algorithm’s inner randomization, which can not be manip
ulated by the attacker. However, the attacker can wait for some large enough T since the rate function αt goes to zero ﬁnally. Moreover, the attacker’s strategy (problem P1) does not depend on the rate function, i.e., the attacker does not
require the knowledge of the parameter of the greedy.
3.2. Ofﬂine attack on UCB
Now we derive the ofﬂine attack strategy for the classical UCB algorithm. For each time t, the UCB algorithm pulls the arm with the highest UCB index:
at = arg max ua(t) := µ˜a(t−1)+3σ
a∈A
log t Na(t − 1) . (13)
The UCB algorithm pulls the target arm a∗ at time T + 1 if and only if the UCB index of arm a∗ is the unique largest
one. Thus, the attacker can manipulate the rewards to satisfy the condition. Given any margin parameter ξ > 0, the
attacker’s optimization problem becomes
P2 : min
a :a∈A
s.t.
 a22
a∈A
ua∗ (T + 1) ≥ ua(T + 1) + ξ,
(14) ∀a = a∗
The condition in the above optimization implies that the UCB algorithm will pull the target arm after the poisoning attack. The following result shows that there exists at least one optimal solution of problem P2, i.e., one optimal ofﬂine attack for the UCB algorithm.
Data Poisoning Attacks on Stochastic Bandits
Theorem 2. Given any margin parameter ξ > 0. For any reward instance {ya}a∈A, there exists at least one optimal solution of problem P2, which is a quadratic program with linear constraints. Hence, there exists at least one optimal ofﬂine attack for the UCB algorithm.
The proof of Theorem 2 is similar to the proof of Theorem 1. Note that the above attack strategy holds for any error tolerance δ since UCB algorithm is not randomized.
3.3. Empirical mean based bandit algorithms
One insight from our ofﬂine attacks on the greedy algorithm and the UCB algorithm is that the empirical mean based algorithms are vulnerable to attack. This is because the empirical mean related constraints are linear and nonempty. Then the optimization problem P becomes a quadratic problem with nonempty linear constraints, which can be solved efﬁciently. This result holds for many other variants of UCB algorithms and ExploreThenCommit algorithms (Garivier et al., 2016) (which uniformly samples the arm set in the ﬁrst exploration phase and commit to the arm with the largest empirical mean in the second commitment phase).
3.4. Beyond empirical means: Thompson Sampling for Gaussian distributions
We have shown that the empirical mean based algorithms are not secure against the ofﬂine attack. It would appear that Bayesian algorithms should be more robust to the ofﬂine attack as the constraint of problem P becomes nontractable. Unfortunately, we ﬁnd that Thompson Sampling, a popular Bayesian algorithm, is also vulnerable to the data poisoning.
Now, we derive the attack strategy for Thompson Sampling for Gaussian distributions2. In other words, the noise ηt is i.i.d. sampled from Gaussian distribution N (0, σ2). Thompson Sampling for Gaussian distribution with the Jeffreys prior (Korda et al., 2013) works as the following. For each time t, the algorithm samples θa(t) from the posterior distribution N (µ˜a(t − 1)/σ2, σ2/Na(t − 1)) for each arm a independently. Then the algorithm pulls the arm with the highest sample value, i.e., at = arg maxa∈A θa(t).
Let φ(x) = √12π e−x2/2 be the probability density function (pdf) of the standard Gaussian distribution and Φ(x) be the corresponding cumulative distribution function (cdf). For simplicity, we denote φa(x) as the pdf of the Gaussian distribution N (µ˜a(T )/σ2, σ2/Na(T )) for each arm a and Φa(x) as the corresponding cdf. Since we are interested in the samples in time T + 1, we omit time index in the
2Thompson Sampling needs distribution models and Gaussian distribution is popular and wellstudied in the literature. The idea of problem relaxation here can be extended to other distributions.
following discussion. By the law of total expectation, we have that
P{aT +1 = a∗} = P{∩a=a∗ θa∗ > θa}
(15)
+∞
=
P{∩a=a∗ θa < xθa∗ = x}φa∗ (x) dx
−∞
+∞
=
(Πa=a∗ P{θa < xθa∗ = x}) φa∗ (x) dx
−∞
+∞
=
(Πa=a∗ Φa(x)) φa∗ (x) dx.
(16)
−∞
The Equation (16) is complicated to compute and analyze, which makes the instantiation of the ofﬂine attack problem P nontrivial. To address this challenge, we derive a sufﬁcient constraint of the original constraint and make a relaxation of the original problem P . By the union bound, we have that
P{aT +1 = a∗} = P{∪a=a∗ θa∗ < θa} ≤
P{θa∗ < θa}
a=a∗
=
Φ µ˜a(T ) − µ˜a∗ (T )
(17)
a=a∗ σ3 1/ma + 1/ma∗
Thus, we are ready to introduce the attacker’s problem for Thompson Sampling for Gaussian distributions,
P3 : min
a :a∈A
s.t.
 a22
(18)
a∈A
Φ
a=a∗
µ˜a(T ) − µ˜a∗ (T ) σ3 1/ma + 1/ma∗
≤ δ (19)
µ˜a(T ) − µ˜a∗ (T ) ≤ 0, ∀a = a∗ (20)
The constraint of problem P3 is a sufﬁcient condition to the constraint of the problem P by Equation (17). Note that the linear constraints (20) are redundant since Φ(0) = 0.5 and δ < K2−1 is usually satisﬁed. The following proposition shows that the constraint set of problem P3 is convex.
Proposition 2. The constraint set formed by Equations (19) and (20) is convex.
The proof of Proposition 2 is provided in Section A.3 in the supplementary material. The following result shows that there exists at least one optimal solution of problem P3 that is a feasible ofﬂine attack for the Thompson Sampling for Gaussian distributions.
Theorem 3. Given any error tolerance δ > 0. For any reward instance {ya}a∈A, there exists at least one optimal solution of problem P3, which is a quadratic program with convex constraints. Hence, there exists at least one feasible ofﬂine attack for the Thompson Sampling algorithm for Gaussian distributions.
Data Poisoning Attacks on Stochastic Bandits
By Proposition 2, the proof of Theorem 3 is similar to the
proof of Theorem 1. Note that the above attack strategy is not the optimal attack strategy formulated by P . However, it is easy to compute since problem P3 is a quadratic program with convex constraints. Another relaxation of problem P is presented in Section A.4 in the supplementary material.
4. Online Attacks
In this section, we study online attacks against stochastic
bandits. In the online attack setting, the bandit algorithm
updates its policy at each round. The attacker eavesdrops on the decision (i.e., at) of the bandit algorithm and poisons the reward by adding an arbitrary t ∈ R. Hence the reward observed by the bandit algorithm is rt = rt + t. Without loss of generality, we consider the l1norm attack cost. That is, the cost of the attacker for round t is  t. Recall that Na(t) is the number of pulling arm a up to time t.
Without the attacks, the bandit algorithm is a uniformly good policy such that it achieves O(log T ) regret3, i.e., E[ a(maxi µi − µa)Na(T )] = O(log T ) for any problem instance {µa}a∈A. Moreover, the expected number of pulling the optimal arm (with the highest expected reward) is T − o(T ).
The goal of the attacker is to force the bandit algorithm to pull the suboptimal target arm a∗ as much as possible and
pays the least possible total cost. Formally, the attacker
wants the bandit algorithm to receive linear expected regret,
i.e., E[Na∗ (T )] = T − o(T ), and pays the expected total
cost E[
T t

t]
=
O(log T ).
In
other
words,
the
attacker
wants to manipulate the rewards so that the bandit algorithm
considers the target arm as the best arm in the long term.
4.1. Oracle constant attacks
The fact that the attacker does not know the expected rewards {µa}a∈A is challenging because otherwise the attacker can perform the attack trivially. Suppose the attacker knows the expected rewards, then the attacker can choose the following oracle attack strategy,
t = −1{at = a∗}[µat − µa∗ + ξ]+, (21)
for some margin parameter ξ > 0. Note that [·]+ indicates
the function such that [x]+ = max{x, 0} and 1{·} is the
indicator function. By this attack, the bandit algorithm sees a poisoned bandit problem, where the target arm a∗ is the
optimal arm and all the other arms are at least ξ below
the target arm. Thus, the bandit algorithm pulls the target
arm with E[Na∗ (T )] = T − o(T ) and pays the total cost
E[
T t

t]
=
O(log T )
since
t are bounded. This result
has been shown in (Jun et al., 2018) as the following.
3The results in this section directly apply to the problemindependent regret bounds and high probability bounds.
Proposition 3. (Proposition 1 in (Jun et al., 2018)) Assume that the bandit algorithm achieves an O(log T ) regret bound. Then the oracle attack with ξ > 0 succeeds, i.e., E[Na∗ (T )] = T − o(T ). Furthermore, the expected attack cost is O( i=a∗ [µi − µa∗ + ξ]+ log T ).
The insight obtained from Proposition 3 is that the attacker does not need to attack in round t if at = a∗. This helps us to design attack strategies that are similar to the oracle attack. When the ground truth {µa}a∈A is not known to the attacker, the attacker may guess some upper bound {Ca}a=a∗ on {[µa−µa∗ ]+}a=a∗ and perform the following oracle constant attack,
t = −1{at = a∗}Cat . (22)
The following result shows the sufﬁcient and necessary conditions for the oracle constant attack to be successful.
Proposition 4. Assume that the bandit algorithm achieves an O(log T ) regret bound. Then the constant attack with {Ca}a=a∗ succeeds if and only if Ca > [µa − µa∗ ]+, ∀a = a∗. If the attack succeeds, then the expected attack cost is O( a=a∗ Ca log T ).
The proof of Proposition 4 is provided in Section B.1 in the supplementary material. Proposition 4 shows that the attacker has to know the unknown bounds on {[µa−µa∗ ]+}a=a∗ to guarantee a successful constant attack. Moreover, the oracle constant attack is nonadaptive to the problem instance since some upper bound Ca can be much larger than the quantity [µa − µa∗ ]+ so that the attacker is paying unnecessary attack cost compared to the oracle attack. To address this challenge, we propose an adaptive constant attack that can learn the bandit environment in an online fashion and perform the attack adaptively.
4.2. Adaptive attack by constant estimation
Now, we are ready to propose the adaptive attack strategy.
The idea is that the attacker can update upper bounds on the unknown quantity {[µa − µa∗ ]+}a=a∗ based on the empirical mean observed by the attacker. Then the attacker
performs the constant attack based on the estimated upper bounds. Note that the attacker observes the tuple (at, rt) at each time t and is able to obtain an unbiased empirical mean. Let µˆa(t) be the preattack empirical mean observed by the attacker at time t, that is
1t
µˆa(t) := Na(t) rτ 1{aτ = a}.
(23)
τ =1
Given any δ ∈ (0, 1), we deﬁne a function β(n) as
β(n) = 2σ2 log π2Kn2 . (24)
n
3δ
Data Poisoning Attacks on Stochastic Bandits
Then, the following result shows that {β(Na(t))}a∈A is a highprobability bound on the preattack empirical mean of all arms on all rounds. Deﬁne the event
E = {∀a ∈ A, ∀t : µˆa(t) − µa < β(Na(t))}. (25)
Lemma 1. (Lemma 1 in (Jun et al., 2018)) For δ ∈ (0, 1), P(E) > 1 − δ.
The proof of Lemma 1 follows from Hoeffding inequality
and union bound. Lemma 1 shows that with probability at least 1 − δ, we have that the expected reward µa is within the interval (µˆa(t) − β(Na(t)), µˆa(t) + β(Na(t))) centered at the preattack empirical mean µˆa(t). Thus, we obtain an upper bound on the quantity µa − µa∗ and propose the following Adaptive attack by Constant Estimation (ACE),
t = −1{at = a∗}[µˆa (t)−µˆa∗(t)+β(Na (t))+β(Na∗(t))]+.
t
t
The following shows that the ACE is successful for any bandit algorithm without the knowledge of the bandit algorithm.
Theorem 4. Given any δ ∈ (0, 0.5), assume that the bandit algorithm achieves an O(log T ) regret bound with probability at least 1 − δ. With probability at least 1 − 2δ, the ACE attacker forces the bandit algorithm to play the target arm a∗ in Na∗ (T ) times, such that Na∗ (T ) = T − o(T ), using an instantaneous attack cost,
 t ≤ 1{at = a∗} [µa −µa∗ ]++2β(Na (t))+2β(Na∗(t)) .
t
t
In particular, with probability at least 1−2δ, the ACE attack is successful and the accumulated attack cost is
5. Numerical Results
In this section, we run simulations on attacking greedy, UCB and Thompson Sampling algorithms to illustrate our theoretical results. All the simulations are run in MATLAB and the codes can be found in the supplemental materials. Note that we implement greedy with αt = 1t .
5.1. Ofﬂine attacks
To study the effectiveness of the ofﬂine attacks, we consider the following experiment. The bandit has K = 5 arms and the reward noise is a Gaussian distribution N (0, σ2) with σ = 0.1. The attacker’s target arm is arm K such that µK = 0, while the expected rewards of other arms are uniformly distributed in the unit interval [0, 1]. We set T = 1000 and the error tolerance to δ = 0.05.
In each attack trial, we ﬁrst generate the instance of the bandit by drawing the expected rewards from the uniform distribution on [0, 1]. Then we run the bandit algorithm for T rounds and collect all the historical data. Without any attack, the bandit algorithm would have converged to some optimal arm in the trial, which is not the target arm as the target arm is the one with the least payoff. Then we set the margin parameter as ξ = 0.001 and run the corresponding ofﬂine attacks. The attack strategy is the solution of the optimization problem Pi. Since all the problems are quadratic program with linear (convex) constraints, they can be solved by standard optimization tools.
We run 1000 attack trials. Note that the attack cost depends on the instance of the bandit. To evaluate the attack cost, we use the poisoning effort ratio (Ma et al., 2018):
T
 t ≤ O
[µa − µa∗ ]+ + 4β(1) log T .
t=1
a=a∗
The proof of Theorem 4 is provided in Section B.2 in the supplementary material. Theorem 4 shows that the ACE is universally successful for any bandit algorithm, without knowing any prior information on {µa}a∈A. Besides, the ACE incurs an highprobability accumulated attack cost as small as that of the oracle attack4 with only an additional bounded additive term, O(4β(1)K log T ). That is, the ACE is close to the hindsightoracle attack strategy. Moreover, the ACE even requires no knowledge of the bandit algorithm. This is an advantage over the algorithmdependent online attack strategies designed in (Jun et al., 2018) since the attacker may not know which bandit algorithm the learner is in practice. As far as we know, this is the ﬁrst negative result showing that there is no robust bandit algorithm that can be immune to the adaptive online attack. This exposes a signiﬁcant security threat to the bandit learning systems.
 2 =
a∈A  a22 .
(26)
y2
a∈A ya22
The poisoning effort ratio measures the ratio of the total cost over the norm of the original rewards. To evaluate the attack effectiveness, we check whether the poisoned data satisfy the constraint of the optimization problem P . If so, the bandit algorithm will play the target arm with probability at least 1 − δ.
Figure 3 shows the results of the attack against the greedy, UCB and Thompson Sampling. First, the attack strategies are effective as all the attacks are successful. Second, the attack costs are small as shown in the histograms. The ratios of attacking greedy, UCB and Thompson Sampling are less than 10%, 2% and 5%.
Number of trials
Data Poisoning Attacks on Stochastic Bandits
180 160 140 120 100
80 60 40 20
0 0
Success rate =100%
0.02
0.04
0.06
0.08
0.1
Poisoning effort ratio
(a) Attack on greedy algorithm
Number of trials
120 Success rate =100%
100
80
60
40
20
0
0
0.005
0.01
0.015
0.02
0.025
Poisoning effort ratio
(b) Attack on UCB algorithm
Number of trials
140 Success rate =100%
120
100
80
60
40
20
0 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
Poisoning effort ratio
(c) Attack on Thompson Sampling
Figure 3: Histograms of poisoning effort ratio in the ofﬂine attacks.
10
ACE,∆=1
9
Jun,∆=1
ACE,∆=0.5
8
Jun,∆=0.5
ACE,∆=0.2
7
Jun,∆=0.2
6
5
4
3
2
1
0
100
101
102
103
Time
10 ×104
Attacked
9
Without attack
8
7
6
5
4
3
2
1
0
0
2
4
6
Time
104
105
8
10
×104
Cost
Target arm pulls
250
ACE,∆=1 Jun,∆=1
ACE,∆=0.5
200
Jun,∆=0.5
ACE,∆=0.2 Jun,∆=0.2
150
100
50
0
100
101
102
103
Time
10 ×104
Attacked
9
Without attack
8
7
6
5
4
3
2
1
0
0
2
4
6
Time
104
105
8
10
×104
Target arm pulls
Cost
6
ACE,∆=1
ACE,∆=0.5
5
ACE,∆=0.2
4
3
2
1
0
100
101
102
103
Time
10 ×104
Attacked
9
Without attack
8
7
6
5
4
3
2
1
0
0
2
4
6
Time
104
105
8
10
×104
(a) Attack on greedy algorithm
(b) Attack on UCB algorithm
(c) Attack on Thompson Sampling
Figure 4: Attack cost and target arm pulls in the online attacks.
Cost
Target arm pulls
5.2. Online attacks
We compare our adaptive attack strategy with two attack strategies proposed by (Jun et al., 2018). Note that these two attacks are against greedy and UCB algorithm and require the knowledge of the algorithm. In contrast, we highlight that our attack strategy ACE is an universal attack strategy against any bandit algorithm.
We consider the following experiment. The bandit has two arms. The expected rewards of arms 1 and 2 are µ1 = ∆ and µ2 = 0 with ∆ > 0. The attacker’s target is arm 2. The noise of the rewards is i.i.d. sampled from the Gaussian distribution N (0, σ2) with σ = 0.1. We set the error tolerance to δ = 0.05 and time horizon to T = 105 rounds. For the implementations of the attack strategies proposed by (Jun et al., 2018), we choose the tuning parameter ∆0 = σ, which is
4Highprobability bounds can be adapted from Proposition 3.
suggested by (Jun et al., 2018) when T is not known to the attacker. We run 100 attack trials with different ∆ values.
Figure 4 shows the average attack cost and number of target arm pulls in the online attacks. Note that the target arm pulls are the cases when ∆ = 1. First, we compare the number of target arm pulls with ACE attack and without. ACE attack dramatically forces the bandit algorithm to pull the target arm linearly in time. Second, the attack costs are relatively small compared to the loss of the bandit algorithm, which is linear in time. Generally, the attack costs of ACE attack are bounded by O(log T ) and increase as the reward gap ∆ becomes larger. These verify the result of Theorem 4. On the other hand, the attack costs on Thompson Sampling and greedy are relatively smaller than that of UCB. This is because Thompson Sampling and greedy converges to the “optimal” arm very fast while the exploration for “nonoptimal” arm may still increase over time. Finally, compared
Data Poisoning Attacks on Stochastic Bandits
to the algorithmspeciﬁc attacks proposed by (Jun et al., 2018), the attack cost of ACE on greedy is slightly worse while the attack cost of ACE on UCB is much larger than Jun’s attack. In the case of attacking UCB algorithm, our universal attack strategy takes more cost than the algorithmspeciﬁc attack, but without the need to know the algorithm.
6. Conclusion
In this work, we study the open problem of data poisoning attacks on bandit algorithms. We propose an ofﬂine attack framework for the stochastic bandits and propose three algorithmspeciﬁc ofﬂine attack strategies against greedy, UCB and Thompson Sampling. Then, we study an online attack against the bandit algorithms and propose the adaptive attack strategy that can hijack the behavior of any bandit algorithm without requiring the knowledge of the bandit algorithm. Our theoretical results and simulations show that the bandit algorithms are vulnerable to the poisoning attacks in both online and ofﬂine setting.
Acknowledgements
This work has been supported in part by a grant from DTRA (HDTRA11410058), a grant from Ofﬁce of Naval Research (N000141712417), grants from National Science foundation (CNS1446582, CNS1518829, CNS1409336) and a grant from Army Research Ofﬁce (WN11NF1510277). This work has also been supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT), (2017000692, Transportaware Streaming Technique Enabling Ultra LowLatency AR/VR Services).
Conference on Learning Representations, 2015. URL http://arxiv.org/abs/1412.6572.
Gupta, A., Koren, T., and Talwar, K. Better algorithms for stochastic bandits with adversarial corruptions. arXiv preprint arXiv:1902.08647, 2019.
Huang, S., Papernot, N., Goodfellow, I., Duan, Y., and Abbeel, P. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
Jun, K.S., Li, L., Ma, Y., and Zhu, X. Adversarial attacks on stochastic bandits. arXiv preprint arXiv:1810.12188, 2018.
Korda, N., Kaufmann, E., and Munos, R. Thompson sampling for 1dimensional exponential family bandits. In Advances in Neural Information Processing Systems, pp. 1448–1456, 2013.
Li, B., Wang, Y., Singh, A., and Vorobeychik, Y. Data poisoning attacks on factorizationbased collaborative ﬁltering. In Advances in neural information processing systems, pp. 1885–1893, 2016.
Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. ACM, 2010.
Lin, Y.C., Hong, Z.W., Liao, Y.H., Shih, M.L., Liu, M.Y., and Sun, M. Tactics of adversarial attack on deep reinforcement learning agents. arXiv preprint arXiv:1703.06748, 2017.
References
Alfeld, S., Zhu, X., and Barford, P. Data poisoning attacks against autoregressive models. In AAAI, pp. 1452–1458, 2016.
Biggio, B., Nelson, B., and Laskov, P. Poisoning attacks against support vector machines. In Proceedings of the 29th International Conference on Machine Learning, pp. 1467–1474, 2012.
Chapelle, O., Manavoglu, E., and Rosales, R. Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology (TIST), 5(4):61, 2015.
Garivier, A., Lattimore, T., and Kaufmann, E. On explorethencommit strategies. In Advances in Neural Information Processing Systems, pp. 784–792, 2016.
Goodfellow, I., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In International
Liu, F., Wang, S., Buccapatnam, S., and Shroff, N. Ucboost: a boosting approach to tame complexity and optimality for stochastic bandits. arXiv preprint arXiv:1804.05929, 2018.
Lykouris, T., Mirrokni, V., and Paes Leme, R. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 114–122. ACM, 2018.
Ma, Y., Jun, K.S., Li, L., and Zhu, X. Data poisoning attacks in contextual bandits. arXiv preprint arXiv:1808.05760, 2018.
Mei, S. and Zhu, X. Using machine teaching to identify optimal trainingset attacks on machine learners. In AAAI, pp. 2871–2877, 2015.
Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
Data Poisoning Attacks on Stochastic Bandits
Wang, Y. and Chaudhuri, K. Data poisoning attacks against online learning. arXiv preprint arXiv:1808.08994, 2018.
Xiao, H., Biggio, B., Brown, G., Fumera, G., Eckert, C., and Roli, F. Is feature selection secure against training data poisoning? In International Conference on Machine Learning, pp. 1689–1698, 2015.
arXiv:1905.06494v1 [cs.LG] 16 May 2019
Fang Liu 1 Ness Shroff 1 2
Abstract
Stochastic multiarmed bandits form a class of online learning problems that have important applications in online recommendation systems, adaptive medical treatment, and many others. Even though potential attacks against these learning algorithms may hijack their behavior, causing catastrophic loss in realworld applications, little is known about adversarial attacks on bandit algorithms. In this paper, we propose a framework of ofﬂine attacks on bandit algorithms and study convex optimization based attacks on several popular bandit algorithms. We show that the attacker can force the bandit algorithm to pull a target arm with high probability by a slight manipulation of the rewards in the data. Then we study a form of online attacks on bandit algorithms and propose an adaptive attack strategy against any bandit algorithm without the knowledge of the bandit algorithm. Our adaptive attack strategy can hijack the behavior of the bandit algorithm to suffer a linear regret with only a logarithmic cost to the attacker. Our results demonstrate a signiﬁcant security threat to stochastic bandits.
1. Introduction
Understanding adversarial attacks on machine learning systems is essential to developing effective defense mechanisms and an important step toward trustworthy artiﬁcial intelligence. A class of adversarial attacks on machine learning that have received much attention is data poisoning (Biggio et al., 2012; Mei & Zhu, 2015; Xiao et al., 2015; Alfeld et al., 2016; Li et al., 2016; Wang & Chaudhuri, 2018). Here, the attacker is able to access the leaner’s training data, and has the power to manipulate a fraction of the training data in order to make the learner satisfy certain objectives. This
1Department of Electrical and Computer Engineering, 2Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio, USA. Correspondence to: Fang Liu
Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).
is motivated by modern industrial scale applications of machine learning systems, where data collection and policy updates are done in a distributed way. Attacks can happen when the learner is not aware of the attacker’s access to the training data.
While there has been a line of research on adversarial attacks on deep learning (Goodfellow et al., 2015; Huang et al., 2017; Lin et al., 2017) and supervised learning (Biggio et al., 2012; Mei & Zhu, 2015; Xiao et al., 2015; Alfeld et al., 2016; Li et al., 2016), little is known on adversarial attacks on stochastic multiarmed bandits, which is a form of online learning with limited feedback. This is potentially hazardous since stochastic MAB are widely used in the industry to recommend news articles (Li et al., 2010), display advertisements (Chapelle et al., 2015), allocate medical treatment (Thompson, 1933) among many others. Hence, understanding the impact of adversarial attacks on bandit algorithms is an urgent yet open problem.
Recently, there has been an important piece of ofﬂine data poisoning attacks for the contextual bandit algorithm (Ma et al., 2018). They assume that the bandit algorithm updates periodically and that the attacker can manipulate the rewards in the data before the updates in order to hijack the behavior of the bandit algorithm. Consider the news recommendation as a running example for this ofﬂine attack model. A news website has K articles (i.e., arms or actions) and runs a bandit algorithm to learn a recommendation policy. Every time a user visits the website, the bandit algorithm displays an article based on historical data. Then the website receives a binary reward indicating whether the user clicks on the displayed article or not. The website keeps serving the users throughout the day and updates the bandit algorithm during the night. An attacker can perform ofﬂine data poisoning attacks before the bandit algorithm is updated. More specifically, the attacker may poison the rewards collected during the daytime and control the behavior of the bandit algorithm as it wants. The authors in (Ma et al., 2018) show that the ofﬂine attack strategy on LinUCBtype contextual bandit algorithm (Li et al., 2010) can be formulated as a convex optimization problem. However, ofﬂine attack strategies for classical bandit algorithms are still open.
In contrast to ofﬂine attacks, an online form of data poisoning attacks against bandit algorithms has also been proposed
Data Poisoning Attacks on Stochastic Bandits
by (Jun et al., 2018). They assume that the bandit algorithm updates step by step and the attacker sits inbetween the bandit environment and the bandit algorithm. At each time step, the bandit algorithm pulls an arm and the attacker eavesdrops on the decision. Then the attacker makes an attack by manipulating the reward generated from the bandit environment. The bandit algorithm receives the poisoned reward without knowing the presence of the attacker and updates accordingly. The goal of the attacker is to control which arm appears to the bandit algorithm as the best arm at the end. Efﬁcient attack strategies against greedy and Upper Conﬁdence Bounds (UCB) algorithms are proposed by (Jun et al., 2018). However, online attack strategies for other bandit algorithms (e.g., Thompson Sampling (Thompson, 1933) and UCBoost (Liu et al., 2018)) are still unknown.
In this work, we have a systematic investigation of data poisoning attacks against bandit algorithms and address the aforementioned open problems. We study the data poisoning attacks in both the ofﬂine and the online cases. In the ofﬂine setting, the bandit algorithm updates periodically and the attacker can manipulate the rewards in the collected data before the update occurs. In the online setting, the attacker eavesdrops on the bandit algorithm and manipulates the feedback reward. The goal of the attacker is that the bandit algorithm considers the target arm as the optimal arm at the end. Speciﬁcally, we make the following contributions to data poisoning attacks on stochastic bandits.
1. We propose an optimization based framework for ofﬂine attacks on bandit algorithms. Then, we instantiate three ofﬂine attack strategies against greedy, UCB algorithm and Thompson Sampling, which are the solutions of the corresponding convex optimization problems. That is, there exist efﬁcient attack strategies for the attacker against these popular bandit algorithms.
2. We study the online attacks on bandit algorithms and propose an adaptive attack strategy that can hijack any bandit algorithm without knowing the bandit algorithm. As far as we know, this is the ﬁrst negative result showing that there is no robust and good stochastic bandit algorithm that can survive online poisoning attacks.
3. We evaluate our attack strategies by numerical results. Our attack strategies are efﬁcient in forcing the bandit algorithms to pull a target arm at a relatively small cost. Our results expose a signiﬁcant security threat as bandit algorithms are widely employed in the real world applications.
More recently, there is a line of works by (Lykouris et al., 2018; Gupta et al., 2019) studying an adversarial corruption model. While they assume that the attacker has to attack all the arms before the learner chooses an arm, we consider the
case where the attacker’s strategy is aware of and adaptive to the decision of the learner. The difference leads to opposite results. While they propose a robust bandit algorithm, our work shows that the existing algorithms are vulnerable to the adversarial attacks.
This paper is organized as the following. In Section 2, we introduce the problem setting and attack system models. Then, we propose the ofﬂine attack problem in Section 3. In Section 4, we study the online attack problem and propose an adaptive attack strategy. The numerical results to evaluate our attack strategies are provided in Section 5. Finally, we conclude the paper with discussion in Section 6.
2. Problem Formulation
We consider the classical stochastic bandit setting. Suppose that there is a set A = {1, 2, . . . , K} of K arms and the bandit algorithm proceeds in discrete time t = 1, 2, . . . , T . At each round t, the algorithm pulls an arm at ∈ A and the bandit environment generates a reward rt ∈ R such that
rt = µat + ηt,
(1)
where ηt is a zeromean, σsubGaussian noise and µat is the instantaneous reward at time t. In other words, the expected reward of pulling arm a is µa. Note that {µa}a∈A
are unknown to the bandit algorithm and the attacker.
The performance of the bandit algorithm is measured by
the regret, which is the expected difference between the
total rewards obtained by an oracle that always pulls the
optimal arm (i.e., the arm with the largest expected reward maxa∈A µa) and the accumulative rewards collected by the bandit algorithm up to time horizon T . Formally, the regret R(T ) is given by
T
R(T ) = E max µaT − rt .
(2)
a∈A t=1
In this work, we consider the uniformly good bandit algorithm that incurs sublinear regret, i.e., R(T ) = o(T ).
For each reward rt returned from the bandit environment, the attacker manipulates the reward into
rt = rt + t.
(3)
Then the accumulated attack cost of the attacker, C(T ), is measured by the norm of the vector ( 1, . . . , T )T . For simplicity, we consider the l2norm in the ofﬂine setting and the l1norm in the online setting. Note that the results in
this work can be easily extended to any norm. For example, consider the lpnorm, the total attack cost of the attacker is
T
1/p
C(T ) =
 tp
.
(4)
t=1
Data Poisoning Attacks on Stochastic Bandits
a t < l a t e x i t s h a 1 _ b a s e 6 4 = " n 4 f Y N Z o W f + O 7 S E 0 1 l O y 5 I L Q C Y / k = " > A A A B 6 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E U G 8 F P X i s a G y h D W W z 3 b R L N 5 u w O x F K 6 E / w 4 k H F q / / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 M J X C o O t + O 6 W V 1 b X 1 j f J m Z W t 7 Z 3 e v u n / w a J J M M + 6 z R C a 6 H V L D p V D c R 4 G S t 1 P N a R x K 3 g p H 1 1 O / 9 c S 1 E Y l 6 w H H K g 5 g O l I g E o 2 i l e 9 r D X r X m 1 t 0 Z y D L x C l K D A s 1 e 9 a v b T 1 g W c 4 V M U m M 6 n p t i k F O N g k k + q X Q z w 1 P K R n T A O 5 Y q G n M T 5 L N T J + T E K n 0 S J d q W Q j J T f 0 / k N D Z m H I e 2 M 6 Y 4 N I v e V P z P 6 2 Q Y X Q a 5 U G m G X L H 5 o i i T B B M y / Z v 0 h e Y M 5 d g S y r S w t x I 2 p J o y t O l U b A j e 4 s v L x D + r X 9 X d u / N a 4 6 Z I o w x H c A y n 4 M E F N O A W m u A D g w E 8 w y u 8 O d J 5 c d 6 d j 3 l r y S l m D u E P n M 8 f v B O N o Q = = < / l a t e x i t >
bandit algorithm
attacker
{~✏a < l a t e x i t s h a 1 _ b a s e 6 4 = " m p c O l G B I d M l i J g v Q 3 6 w b d h K 6 G A Q = " > A A A B / H i c b V B N S 8 N A E N 3 U r 1 q / 4 s f N S 7 A I n k o q g n o r e v F Y w W i h C W G z n b R L N 7 t h d 1 O o I f h X v H h Q 8 e o P 8 e a / c d v m o K 0 P B h 7 v z T A z L 0 o Z V d p 1 v 6 3 K 0 v L K 6 l p 1 v b a x u b W 9 Y + / u 3 S u R S Q I e E U z I T o Q V M M r B 0 1 Q z 6 K Q S c B I x e I i G 1 x P / Y Q R S U c H v 9 D i F I M F 9 T m N K s D Z S a B / 4 u T 8 C k v u Q K s o E L 0 L s F 6 F d d x v u F M 4 i a Z a k j k q 0 Q / v L 7 w m S J c A 1 Y V i p b t N N d Z B j q S l h U N T 8 T E G K y R D 3 o W s o x w m o I J 9 e X z j H R u k 5 s Z C m u H a m 6 u + J H C d K j Z P I d C Z Y D 9 S 8 N x H / 8 7 q Z j i + C n P I 0 0 8 D J b F G c M U c L Z x K F 0 6 M S i G Z j Q z C R 1 N z q k A G W m G g T W M 2 E 0 J x / e Z F 4 p 4 3 L h n t 7 V m 9 d l W l U 0 S E 6 Q i e o i c 5 R C 9 2 g N v I Q Q Y / o G b 2 i N + v J e r H e r Y 9 Z a 8 U q Z / b R H 1 i f P 8 H 1 l Z 4 = < / l a t e x i t >
}
data buffer of size T
bandit environment
( a t < l a t e x i t s h a 1 _ b a s e 6 4 = " r 1 y + / y t P O g S m 2 s V t U 7 + i 0 7 T 0 + N E = " > A A A B 7 3 i c b V B N S 8 N A E N 3 4 W e t X 1 a O X x S J U k J K I o N 4 K e v B Y w d h K G 8 J m u 2 m X 7 m 7 C 7 k Q o o b / C i w c V r / 4 d b / 4 b t 2 0 O 2 v p g 4 P H e D D P z o l R w A 6 7 7 7 S w t r 6 y u r Z c 2 y p t b 2 z u 7 l b 3 9 B 5 N k m j K f J i L R 7 Y g Y J r h i P n A Q r J 1 q R m Q k W C s a X k / 8 1 h P T h i f q H k Y p C y T p K x 5 z S s B K j z U S w q k O 4 S S s V N 2 6 O w V e J F 5 B q q h A M 6 x 8 d X s J z S R T Q A U x p u O 5 K Q Q 5 0 c C p Y O N y N z M s J X R I + q x j q S K S m S C f H j z G x 1 b p 4 T j R t h T g q f p 7 I i f S m J G M b K c k M D D z 3 k T 8 z + t k E F 8 G O V d p B k z R 2 a I 4 E x g S P P k e 9 7 h m F M T I E k I 1 t 7 d i O i C a U L A Z l W 0 I 3 v z L i 8 Q / q 1 / V 3 b v z a u O m S K O E D t E R q i E P X a A G u k V N 5 C O K J H p G r + j N 0 c 6 L 8 + 5 8 z F q X n G L m A P 2 B 8 / k D T J m P n w = = < / l a t e x i t >
,
rt)
a t < l a t e x i t s h a 1 _ b a s e 6 4 = " n 4 f Y N Z o W f + O 7 S E 0 1 l O y 5 I L Q C Y / k = " > A A A B 6 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E U G 8 F P X i s a G y h D W W z 3 b R L N 5 u w O x F K 6 E / w 4 k H F q / / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 M J X C o O t + O 6 W V 1 b X 1 j f J m Z W t 7 Z 3 e v u n / w a J J M M + 6 z R C a 6 H V L D p V D c R 4 G S t 1 P N a R x K 3 g p H 1 1 O / 9 c S 1 E Y l 6 w H H K g 5 g O l I g E o 2 i l e 9 r D X r X m 1 t 0 Z y D L x C l K D A s 1 e 9 a v b T 1 g W c 4 V M U m M 6 n p t i k F O N g k k + q X Q z w 1 P K R n T A O 5 Y q G n M T 5 L N T J + T E K n 0 S J d q W Q j J T f 0 / k N D Z m H I e 2 M 6 Y 4 N I v e V P z P 6 2 Q Y X Q a 5 U G m G X L H 5 o i i T B B M y / Z v 0 h e Y M 5 d g S y r S w t x I 2 p J o y t O l U b A j e 4 s v L x D + r X 9 X d u / N a 4 6 Z I o w x H c A y n 4 M E F N O A W m u A D g w E 8 w y u 8 O d J 5 c d 6 d j 3 l r y S l m D u E P n M 8 f v B O N o Q = = < / l a t e x i t >
bandit algorithm
attacker
✏< l a t e x i t s h a 1 _ b a s e 6 4 = " C 0 D S r 4 2 Q Q i r t G n N m o 4 v 9 x B W / b P Q = " > A A A B 8 H i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b B U 0 l E U G 8 F P X i s Y G y x D W W z n b R L N 5 u w O x F K 6 L / w 4 k H F q z / H m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 M J X C o O t + O 6 W V 1 b X 1 j f J m Z W t 7 Z 3 e v u n / w Y J J M c / B 5 I h P d D p k B K R T 4 K F B C O 9 X A 4 l B C K x x d T / 3 W E 2 g j E n W P 4 x S C m A 2 U i A R n a K X H L q R G y E T 1 s F e t u X V 3 B r p M v I L U S I F m r / r V 7 S c 8 i 0 E h l 8 y Y j u e m G O R M o + A S J p V u Z i B l f M Q G 0 L F U s R h M k M 8 u n t A T q / R p l G h b C u l M / T 2 R s 9 i Y c R z a z p j h 0 C x 6 U / E / r 5 N h d B n k Q q U Z g u L z R V E m K S Z 0 + j 7 t C w 0 c 5 d g S x r W w t 1 I + Z J p x t C F V b A j e 4 s v L x D + r X 9 X d u / N a 4 6 Z I o 0 y O y D E 5 J R 6 5 I A 1 y S 5 r E J 5 w o 8 k x e y Z t j n B f n 3 f m Y t 5 a c Y u a Q / I H z + Q N I k Z D c < / l a t e x i t > t
bandit environment
rt0< l a t e x i t s h a 1 _ b a s e 6 4 = " M 4 U / n I c l + 4 w F d / l E w Q X + v N / 2 D t A = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b R U 0 l F U G 8 F P X i s Y G y h D W W z 3 b R L d 5 O w O x F K 6 F / w 4 k H F q 7 / I m / / G T Z u D t j 4 Y e L w 3 w 8 y 8 I J H C o O t + O 6 W V 1 b X 1 j f J m Z W t 7 Z 3 e v u n / w a O J U M + 6 x W M a 6 E 1 D D p Y i 4 h w I l 7 y S a U x V I 3 g 7 G N 7 n f f u L a i D h 6 w E n C f U W H k Q g F o 5 h L + r S P / W r N r b s z k G X S K E g N C r T 6 1 a / e I G a p 4 h E y S Y 3 p N t w E / Y x q F E z y a a W X G p 5 Q N q Z D 3 r U 0 o o o b P 5 v d O i U n V h m Q M N a 2 I i Q z 9 f d E R p U x E x X Y T k V x Z B a 9 X P z P 6 6 Y Y X v m Z i J I U e c T m i 8 J U E o x J / j g Z C M 0 Z y o k l l G l h b y V s R D V l a O O p 2 B A a i y 8 v E + + 8 f l 1 3 7 y 9 q z d s i j T I c w T G c Q Q M u o Q l 3 0 A I P G I z g G V 7 h z V H O i / P u f M x b S 0 4 x c w h / 4 H z + A D a T j e M = < / l a t e x i t > rt < l a t e x i t s h a 1 _ b a s e 6 4 = " a J W o N E K S m f Y / G u V K x + U T 0 c 8 Q N V Q = " > A A A B 6 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E U G 8 F P X i s a G y h D W W z 3 b R L N 5 u w O x F K 6 E / w 4 k H F q / / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 M J X C o O t + O 6 W V 1 b X 1 j f J m Z W t 7 Z 3 e v u n / w a J J M M + 6 z R C a 6 H V L D p V D c R 4 G S t 1 P N a R x K 3 g p H 1 1 O / 9 c S 1 E Y l 6 w H H K g 5 g O l I g E o 2 i l e 9 3 D X r X m 1 t 0 Z y D L x C l K D A s 1 e 9 a v b T 1 g W c 4 V M U m M 6 n p t i k F O N g k k + q X Q z w 1 P K R n T A O 5 Y q G n M T 5 L N T J + T E K n 0 S J d q W Q j J T f 0 / k N D Z m H I e 2 M 6 Y 4 N I v e V P z P 6 2 Q Y X Q a 5 U G m G X L H 5 o i i T B B M y / Z v 0 h e Y M 5 d g S y r S w t x I 2 p J o y t O l U b A j e 4 s v L x D + r X 9 X d u / N a 4 6 Z I o w x H c A y n 4 M E F N O A W m u A D g w E 8 w y u 8 O d J 5 c d 6 d j 3 l r y S l m D u E P n M 8 f 1 e i N s g = = < / l a t e x i t >
Figure 1: Ofﬂine attack system model
Figure 2: Online attack system model
Without loss of generality, we assume that arm a∗ is a suboptimal attack target, such that µa∗ < maxa∈A µa. The attacker’s goal is to manipulate the bandit algorithm into pulling arm a∗ frequently. To avoid being detected, the attacker also wants to keep the cost as small as possible.
2.1. Ofﬂine attack system model
The ofﬂine attack system model is illustrated in Figure 1. Besides the bandit algorithm, the bandit environment and the attacker, there is a data buffer of size T . This models the setting where updates of the bandit algorithm happen in minibatches of size T . For round t = 1, . . . , T , the bandit algorithm sequentially pulls arm at. Then the environment generates the reward rt according to Equation (1) and send the tuple (at, rt) to the data buffer. The data buffer stores the data stream until round T . At the end of round T , the attacker accesses the data buffer and poisons the data by Equation (3). Finally, the data buffer sends the poisoned data stream {(at, rt)}t≤T to the bandit algorithm and the bandit algorithm updates according to the received data without knowing the existence of the attacker.
The goal of the attacker in the ofﬂine setting is to force the bandit algorithm to pull the target arm a∗ with high probability at round T +1 (i.e., after updating with poisoned data) while incurring only a small cost. This means that the attacker wants to make the poisoning effort t as small as possible to keep stealthy.
2.2. Online attack system model
The online attack system model is illustrated in Figure 2. In the online setting, the bandit algorithm updates instantly for each time step. The attacker stealthily monitors the decision of the bandit algorithm at at each time t and poisons the reward signal returned from the bandit environment by equation (3). Then the bandit algorithm receives the manipulated reward signal rt and updates unaware of the attacker.
The goal of the attacker in the online setting is to hijack the behavior of the bandit algorithm with high probability by manipulating the reward signal so that the bandit algorithm
pulls the target arm a∗ in Θ(T ) time steps. In the meantime, the attacker wants to control its attack cost by poisoning as infrequently as possible in order to avoid being detected. Note that t = 0 is considered as no attack.
By the linearity of expectation and ηt is a zeromean noise,
R(T ) =
max µi − µa E[Na(T )], (5)
i∈A
a∈A
where Na(t) is the number of pulling arm a up to time t. Thus, the attack goal means that the attacker wants the bandit algorithm to incur a linear regret by incurring only a
sublinear attack cost.
3. Ofﬂine Attacks
In this section, we introduce the ofﬂine attack framework to stochastic bandits. The updates of the bandit algorithm happen in minibatches of size T 1. Between these consecutive updates, the bandit algorithm follows a ﬁxed algorithm obtained from the last update. This allows the attacker to poison the historical data before the update and force the algorithm to pull a target arm a∗ with high probability.
Let ma be the number of times arm a was pulled up to time T , i.e., ma := Na(T ). For each arm a ∈ A, let ya ∈ Rma be the corresponding reward vector returned from the bandit environment when arm a was pulled. That is, ya := (rt : at = a)T . Let a ∈ Rma be the poisoning attack strategy of the attacker, i.e., a := ( t : at = a)T . The poisoned reward vector for arm a after the attack becomes ya + a. To avoid being detected, the attacker hopes to make the poisoning a as small as possible. Without loss of generality, we consider the l2norm attack cost in the ofﬂine attacks. We have that
T
C(T )2 =
2 t
=
 a22.
(6)
t=1
a∈A
Thus, the attacker’s ofﬂine attack problem can be formulated 1The batch size T is a relatively large integer compared to K.
Data Poisoning Attacks on Stochastic Bandits
as the following optimization problem P ,
P : min
 a22
(7)
a:a∈A a∈A
s.t. P{aT +1 = a∗} ≥ 1 − δ,
(8)
for some error tolerance δ > 0. Note that we deﬁne the attack goal as forcing the bandit algorithm to pull arm a∗ at the next round with high probability. This is because there are some randomized bandit algorithms, such as greedy and Thompson Sampling. It is not feasible to force the randomized algorithm to pull a target arm with probability 1. But it is possible to hijack the behavior of the randomized algorithm with high probability.
Proposition 1. Given some error tolerance δ > 0. If { ∗a}a∈A is the optimal solution of problem P , then it is the optimal ofﬂine attack strategy for the attacker.
The proof of Proposition 1 follows from equation (6) and the deﬁnition of ofﬂine attacks. Note that the constraint of problem P depends on the bandit algorithm. Now, we assume that the attacker knows the bandit algorithm and we introduce algorithmspeciﬁc ofﬂine attack strategies derived from the optimization problem P for three popular bandit algorithms, greedy, UCB and Thompson Sampling. For simplicity, we denote the postattack empirical mean observed by the bandit algorithm at the end of round t as
1t
µ˜a(t) := Na(t) rτ 1{aτ = a}.
(9)
τ =1
3.1. Ofﬂine attack on greedy
Now, we derive the ofﬂine attack strategy for the greedy algorithm, which is a randomized algorithm with some decreasing rate function αt. Typically, αt = Θ(1/t). At each time t, the greedy algorithm pulls an arm
at = draw uniformly over A, w.p. αt . (10) arg maxa∈A µ˜a(t − 1), otherwise
At time step T + 1, the greedy algorithm uniformly samples an arm from the arm set A with probability αT +1, which cannot be controlled by the attacker. Then, we set the error tolerance as δ = KK−1 αT +1 since the target arm a∗ may also be pulled in the uniform sampling. Thus, the attacker poisons the reward such that the target arm a∗
has the largest empirical mean. After the attack, the greedy algorithm pulls an arm aT +1 at time T + 1 such that P{aT +1 = a∗} = 1 − δ. In order to make the target arm the unique arm with the largest empirical mean, we introduce a margin parameter ξ > 0. So the attacker’s optimization
problem for attacking greedy algorithm is the following
problem P1,
P1 : min
 a22
(11)
a:a∈A a∈A
s.t. µ˜a∗ (T ) ≥ µ˜a(T ) + ξ, ∀a = a∗, (12)
where µ˜a(T ) = (ya + a)T 1/ma and 1 is the vector that each element is 1. The condition in the above optimization
implies that the greedy algorithm will play the target arm a∗ at the time step T + 1 with probability of at least 1 − KK−1 αT +1. The following result shows that there exists at least one optimal solution of problem P1, i.e., one optimal ofﬂine attack for the greedy algorithm.
Theorem 1. Given any margin parameter ξ > 0. For any reward instance {ya}a∈A, there exists at least one optimal solution of problem P1, which is a quadratic program with linear constraints. Hence, there exists at least one optimal
ofﬂine attack for the greedy algorithm.
The proof of Theorem 1 is provided in Section A.1 in the
supplementary material. Note that the error tolerance parameter, δ = KK−1 αT +1, depends on the rate function αt of the greedy algorithm, which is not controllable by the at
tacker. This counts the exploration introduced by the bandit
algorithm’s inner randomization, which can not be manip
ulated by the attacker. However, the attacker can wait for some large enough T since the rate function αt goes to zero ﬁnally. Moreover, the attacker’s strategy (problem P1) does not depend on the rate function, i.e., the attacker does not
require the knowledge of the parameter of the greedy.
3.2. Ofﬂine attack on UCB
Now we derive the ofﬂine attack strategy for the classical UCB algorithm. For each time t, the UCB algorithm pulls the arm with the highest UCB index:
at = arg max ua(t) := µ˜a(t−1)+3σ
a∈A
log t Na(t − 1) . (13)
The UCB algorithm pulls the target arm a∗ at time T + 1 if and only if the UCB index of arm a∗ is the unique largest
one. Thus, the attacker can manipulate the rewards to satisfy the condition. Given any margin parameter ξ > 0, the
attacker’s optimization problem becomes
P2 : min
a :a∈A
s.t.
 a22
a∈A
ua∗ (T + 1) ≥ ua(T + 1) + ξ,
(14) ∀a = a∗
The condition in the above optimization implies that the UCB algorithm will pull the target arm after the poisoning attack. The following result shows that there exists at least one optimal solution of problem P2, i.e., one optimal ofﬂine attack for the UCB algorithm.
Data Poisoning Attacks on Stochastic Bandits
Theorem 2. Given any margin parameter ξ > 0. For any reward instance {ya}a∈A, there exists at least one optimal solution of problem P2, which is a quadratic program with linear constraints. Hence, there exists at least one optimal ofﬂine attack for the UCB algorithm.
The proof of Theorem 2 is similar to the proof of Theorem 1. Note that the above attack strategy holds for any error tolerance δ since UCB algorithm is not randomized.
3.3. Empirical mean based bandit algorithms
One insight from our ofﬂine attacks on the greedy algorithm and the UCB algorithm is that the empirical mean based algorithms are vulnerable to attack. This is because the empirical mean related constraints are linear and nonempty. Then the optimization problem P becomes a quadratic problem with nonempty linear constraints, which can be solved efﬁciently. This result holds for many other variants of UCB algorithms and ExploreThenCommit algorithms (Garivier et al., 2016) (which uniformly samples the arm set in the ﬁrst exploration phase and commit to the arm with the largest empirical mean in the second commitment phase).
3.4. Beyond empirical means: Thompson Sampling for Gaussian distributions
We have shown that the empirical mean based algorithms are not secure against the ofﬂine attack. It would appear that Bayesian algorithms should be more robust to the ofﬂine attack as the constraint of problem P becomes nontractable. Unfortunately, we ﬁnd that Thompson Sampling, a popular Bayesian algorithm, is also vulnerable to the data poisoning.
Now, we derive the attack strategy for Thompson Sampling for Gaussian distributions2. In other words, the noise ηt is i.i.d. sampled from Gaussian distribution N (0, σ2). Thompson Sampling for Gaussian distribution with the Jeffreys prior (Korda et al., 2013) works as the following. For each time t, the algorithm samples θa(t) from the posterior distribution N (µ˜a(t − 1)/σ2, σ2/Na(t − 1)) for each arm a independently. Then the algorithm pulls the arm with the highest sample value, i.e., at = arg maxa∈A θa(t).
Let φ(x) = √12π e−x2/2 be the probability density function (pdf) of the standard Gaussian distribution and Φ(x) be the corresponding cumulative distribution function (cdf). For simplicity, we denote φa(x) as the pdf of the Gaussian distribution N (µ˜a(T )/σ2, σ2/Na(T )) for each arm a and Φa(x) as the corresponding cdf. Since we are interested in the samples in time T + 1, we omit time index in the
2Thompson Sampling needs distribution models and Gaussian distribution is popular and wellstudied in the literature. The idea of problem relaxation here can be extended to other distributions.
following discussion. By the law of total expectation, we have that
P{aT +1 = a∗} = P{∩a=a∗ θa∗ > θa}
(15)
+∞
=
P{∩a=a∗ θa < xθa∗ = x}φa∗ (x) dx
−∞
+∞
=
(Πa=a∗ P{θa < xθa∗ = x}) φa∗ (x) dx
−∞
+∞
=
(Πa=a∗ Φa(x)) φa∗ (x) dx.
(16)
−∞
The Equation (16) is complicated to compute and analyze, which makes the instantiation of the ofﬂine attack problem P nontrivial. To address this challenge, we derive a sufﬁcient constraint of the original constraint and make a relaxation of the original problem P . By the union bound, we have that
P{aT +1 = a∗} = P{∪a=a∗ θa∗ < θa} ≤
P{θa∗ < θa}
a=a∗
=
Φ µ˜a(T ) − µ˜a∗ (T )
(17)
a=a∗ σ3 1/ma + 1/ma∗
Thus, we are ready to introduce the attacker’s problem for Thompson Sampling for Gaussian distributions,
P3 : min
a :a∈A
s.t.
 a22
(18)
a∈A
Φ
a=a∗
µ˜a(T ) − µ˜a∗ (T ) σ3 1/ma + 1/ma∗
≤ δ (19)
µ˜a(T ) − µ˜a∗ (T ) ≤ 0, ∀a = a∗ (20)
The constraint of problem P3 is a sufﬁcient condition to the constraint of the problem P by Equation (17). Note that the linear constraints (20) are redundant since Φ(0) = 0.5 and δ < K2−1 is usually satisﬁed. The following proposition shows that the constraint set of problem P3 is convex.
Proposition 2. The constraint set formed by Equations (19) and (20) is convex.
The proof of Proposition 2 is provided in Section A.3 in the supplementary material. The following result shows that there exists at least one optimal solution of problem P3 that is a feasible ofﬂine attack for the Thompson Sampling for Gaussian distributions.
Theorem 3. Given any error tolerance δ > 0. For any reward instance {ya}a∈A, there exists at least one optimal solution of problem P3, which is a quadratic program with convex constraints. Hence, there exists at least one feasible ofﬂine attack for the Thompson Sampling algorithm for Gaussian distributions.
Data Poisoning Attacks on Stochastic Bandits
By Proposition 2, the proof of Theorem 3 is similar to the
proof of Theorem 1. Note that the above attack strategy is not the optimal attack strategy formulated by P . However, it is easy to compute since problem P3 is a quadratic program with convex constraints. Another relaxation of problem P is presented in Section A.4 in the supplementary material.
4. Online Attacks
In this section, we study online attacks against stochastic
bandits. In the online attack setting, the bandit algorithm
updates its policy at each round. The attacker eavesdrops on the decision (i.e., at) of the bandit algorithm and poisons the reward by adding an arbitrary t ∈ R. Hence the reward observed by the bandit algorithm is rt = rt + t. Without loss of generality, we consider the l1norm attack cost. That is, the cost of the attacker for round t is  t. Recall that Na(t) is the number of pulling arm a up to time t.
Without the attacks, the bandit algorithm is a uniformly good policy such that it achieves O(log T ) regret3, i.e., E[ a(maxi µi − µa)Na(T )] = O(log T ) for any problem instance {µa}a∈A. Moreover, the expected number of pulling the optimal arm (with the highest expected reward) is T − o(T ).
The goal of the attacker is to force the bandit algorithm to pull the suboptimal target arm a∗ as much as possible and
pays the least possible total cost. Formally, the attacker
wants the bandit algorithm to receive linear expected regret,
i.e., E[Na∗ (T )] = T − o(T ), and pays the expected total
cost E[
T t

t]
=
O(log T ).
In
other
words,
the
attacker
wants to manipulate the rewards so that the bandit algorithm
considers the target arm as the best arm in the long term.
4.1. Oracle constant attacks
The fact that the attacker does not know the expected rewards {µa}a∈A is challenging because otherwise the attacker can perform the attack trivially. Suppose the attacker knows the expected rewards, then the attacker can choose the following oracle attack strategy,
t = −1{at = a∗}[µat − µa∗ + ξ]+, (21)
for some margin parameter ξ > 0. Note that [·]+ indicates
the function such that [x]+ = max{x, 0} and 1{·} is the
indicator function. By this attack, the bandit algorithm sees a poisoned bandit problem, where the target arm a∗ is the
optimal arm and all the other arms are at least ξ below
the target arm. Thus, the bandit algorithm pulls the target
arm with E[Na∗ (T )] = T − o(T ) and pays the total cost
E[
T t

t]
=
O(log T )
since
t are bounded. This result
has been shown in (Jun et al., 2018) as the following.
3The results in this section directly apply to the problemindependent regret bounds and high probability bounds.
Proposition 3. (Proposition 1 in (Jun et al., 2018)) Assume that the bandit algorithm achieves an O(log T ) regret bound. Then the oracle attack with ξ > 0 succeeds, i.e., E[Na∗ (T )] = T − o(T ). Furthermore, the expected attack cost is O( i=a∗ [µi − µa∗ + ξ]+ log T ).
The insight obtained from Proposition 3 is that the attacker does not need to attack in round t if at = a∗. This helps us to design attack strategies that are similar to the oracle attack. When the ground truth {µa}a∈A is not known to the attacker, the attacker may guess some upper bound {Ca}a=a∗ on {[µa−µa∗ ]+}a=a∗ and perform the following oracle constant attack,
t = −1{at = a∗}Cat . (22)
The following result shows the sufﬁcient and necessary conditions for the oracle constant attack to be successful.
Proposition 4. Assume that the bandit algorithm achieves an O(log T ) regret bound. Then the constant attack with {Ca}a=a∗ succeeds if and only if Ca > [µa − µa∗ ]+, ∀a = a∗. If the attack succeeds, then the expected attack cost is O( a=a∗ Ca log T ).
The proof of Proposition 4 is provided in Section B.1 in the supplementary material. Proposition 4 shows that the attacker has to know the unknown bounds on {[µa−µa∗ ]+}a=a∗ to guarantee a successful constant attack. Moreover, the oracle constant attack is nonadaptive to the problem instance since some upper bound Ca can be much larger than the quantity [µa − µa∗ ]+ so that the attacker is paying unnecessary attack cost compared to the oracle attack. To address this challenge, we propose an adaptive constant attack that can learn the bandit environment in an online fashion and perform the attack adaptively.
4.2. Adaptive attack by constant estimation
Now, we are ready to propose the adaptive attack strategy.
The idea is that the attacker can update upper bounds on the unknown quantity {[µa − µa∗ ]+}a=a∗ based on the empirical mean observed by the attacker. Then the attacker
performs the constant attack based on the estimated upper bounds. Note that the attacker observes the tuple (at, rt) at each time t and is able to obtain an unbiased empirical mean. Let µˆa(t) be the preattack empirical mean observed by the attacker at time t, that is
1t
µˆa(t) := Na(t) rτ 1{aτ = a}.
(23)
τ =1
Given any δ ∈ (0, 1), we deﬁne a function β(n) as
β(n) = 2σ2 log π2Kn2 . (24)
n
3δ
Data Poisoning Attacks on Stochastic Bandits
Then, the following result shows that {β(Na(t))}a∈A is a highprobability bound on the preattack empirical mean of all arms on all rounds. Deﬁne the event
E = {∀a ∈ A, ∀t : µˆa(t) − µa < β(Na(t))}. (25)
Lemma 1. (Lemma 1 in (Jun et al., 2018)) For δ ∈ (0, 1), P(E) > 1 − δ.
The proof of Lemma 1 follows from Hoeffding inequality
and union bound. Lemma 1 shows that with probability at least 1 − δ, we have that the expected reward µa is within the interval (µˆa(t) − β(Na(t)), µˆa(t) + β(Na(t))) centered at the preattack empirical mean µˆa(t). Thus, we obtain an upper bound on the quantity µa − µa∗ and propose the following Adaptive attack by Constant Estimation (ACE),
t = −1{at = a∗}[µˆa (t)−µˆa∗(t)+β(Na (t))+β(Na∗(t))]+.
t
t
The following shows that the ACE is successful for any bandit algorithm without the knowledge of the bandit algorithm.
Theorem 4. Given any δ ∈ (0, 0.5), assume that the bandit algorithm achieves an O(log T ) regret bound with probability at least 1 − δ. With probability at least 1 − 2δ, the ACE attacker forces the bandit algorithm to play the target arm a∗ in Na∗ (T ) times, such that Na∗ (T ) = T − o(T ), using an instantaneous attack cost,
 t ≤ 1{at = a∗} [µa −µa∗ ]++2β(Na (t))+2β(Na∗(t)) .
t
t
In particular, with probability at least 1−2δ, the ACE attack is successful and the accumulated attack cost is
5. Numerical Results
In this section, we run simulations on attacking greedy, UCB and Thompson Sampling algorithms to illustrate our theoretical results. All the simulations are run in MATLAB and the codes can be found in the supplemental materials. Note that we implement greedy with αt = 1t .
5.1. Ofﬂine attacks
To study the effectiveness of the ofﬂine attacks, we consider the following experiment. The bandit has K = 5 arms and the reward noise is a Gaussian distribution N (0, σ2) with σ = 0.1. The attacker’s target arm is arm K such that µK = 0, while the expected rewards of other arms are uniformly distributed in the unit interval [0, 1]. We set T = 1000 and the error tolerance to δ = 0.05.
In each attack trial, we ﬁrst generate the instance of the bandit by drawing the expected rewards from the uniform distribution on [0, 1]. Then we run the bandit algorithm for T rounds and collect all the historical data. Without any attack, the bandit algorithm would have converged to some optimal arm in the trial, which is not the target arm as the target arm is the one with the least payoff. Then we set the margin parameter as ξ = 0.001 and run the corresponding ofﬂine attacks. The attack strategy is the solution of the optimization problem Pi. Since all the problems are quadratic program with linear (convex) constraints, they can be solved by standard optimization tools.
We run 1000 attack trials. Note that the attack cost depends on the instance of the bandit. To evaluate the attack cost, we use the poisoning effort ratio (Ma et al., 2018):
T
 t ≤ O
[µa − µa∗ ]+ + 4β(1) log T .
t=1
a=a∗
The proof of Theorem 4 is provided in Section B.2 in the supplementary material. Theorem 4 shows that the ACE is universally successful for any bandit algorithm, without knowing any prior information on {µa}a∈A. Besides, the ACE incurs an highprobability accumulated attack cost as small as that of the oracle attack4 with only an additional bounded additive term, O(4β(1)K log T ). That is, the ACE is close to the hindsightoracle attack strategy. Moreover, the ACE even requires no knowledge of the bandit algorithm. This is an advantage over the algorithmdependent online attack strategies designed in (Jun et al., 2018) since the attacker may not know which bandit algorithm the learner is in practice. As far as we know, this is the ﬁrst negative result showing that there is no robust bandit algorithm that can be immune to the adaptive online attack. This exposes a signiﬁcant security threat to the bandit learning systems.
 2 =
a∈A  a22 .
(26)
y2
a∈A ya22
The poisoning effort ratio measures the ratio of the total cost over the norm of the original rewards. To evaluate the attack effectiveness, we check whether the poisoned data satisfy the constraint of the optimization problem P . If so, the bandit algorithm will play the target arm with probability at least 1 − δ.
Figure 3 shows the results of the attack against the greedy, UCB and Thompson Sampling. First, the attack strategies are effective as all the attacks are successful. Second, the attack costs are small as shown in the histograms. The ratios of attacking greedy, UCB and Thompson Sampling are less than 10%, 2% and 5%.
Number of trials
Data Poisoning Attacks on Stochastic Bandits
180 160 140 120 100
80 60 40 20
0 0
Success rate =100%
0.02
0.04
0.06
0.08
0.1
Poisoning effort ratio
(a) Attack on greedy algorithm
Number of trials
120 Success rate =100%
100
80
60
40
20
0
0
0.005
0.01
0.015
0.02
0.025
Poisoning effort ratio
(b) Attack on UCB algorithm
Number of trials
140 Success rate =100%
120
100
80
60
40
20
0 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
Poisoning effort ratio
(c) Attack on Thompson Sampling
Figure 3: Histograms of poisoning effort ratio in the ofﬂine attacks.
10
ACE,∆=1
9
Jun,∆=1
ACE,∆=0.5
8
Jun,∆=0.5
ACE,∆=0.2
7
Jun,∆=0.2
6
5
4
3
2
1
0
100
101
102
103
Time
10 ×104
Attacked
9
Without attack
8
7
6
5
4
3
2
1
0
0
2
4
6
Time
104
105
8
10
×104
Cost
Target arm pulls
250
ACE,∆=1 Jun,∆=1
ACE,∆=0.5
200
Jun,∆=0.5
ACE,∆=0.2 Jun,∆=0.2
150
100
50
0
100
101
102
103
Time
10 ×104
Attacked
9
Without attack
8
7
6
5
4
3
2
1
0
0
2
4
6
Time
104
105
8
10
×104
Target arm pulls
Cost
6
ACE,∆=1
ACE,∆=0.5
5
ACE,∆=0.2
4
3
2
1
0
100
101
102
103
Time
10 ×104
Attacked
9
Without attack
8
7
6
5
4
3
2
1
0
0
2
4
6
Time
104
105
8
10
×104
(a) Attack on greedy algorithm
(b) Attack on UCB algorithm
(c) Attack on Thompson Sampling
Figure 4: Attack cost and target arm pulls in the online attacks.
Cost
Target arm pulls
5.2. Online attacks
We compare our adaptive attack strategy with two attack strategies proposed by (Jun et al., 2018). Note that these two attacks are against greedy and UCB algorithm and require the knowledge of the algorithm. In contrast, we highlight that our attack strategy ACE is an universal attack strategy against any bandit algorithm.
We consider the following experiment. The bandit has two arms. The expected rewards of arms 1 and 2 are µ1 = ∆ and µ2 = 0 with ∆ > 0. The attacker’s target is arm 2. The noise of the rewards is i.i.d. sampled from the Gaussian distribution N (0, σ2) with σ = 0.1. We set the error tolerance to δ = 0.05 and time horizon to T = 105 rounds. For the implementations of the attack strategies proposed by (Jun et al., 2018), we choose the tuning parameter ∆0 = σ, which is
4Highprobability bounds can be adapted from Proposition 3.
suggested by (Jun et al., 2018) when T is not known to the attacker. We run 100 attack trials with different ∆ values.
Figure 4 shows the average attack cost and number of target arm pulls in the online attacks. Note that the target arm pulls are the cases when ∆ = 1. First, we compare the number of target arm pulls with ACE attack and without. ACE attack dramatically forces the bandit algorithm to pull the target arm linearly in time. Second, the attack costs are relatively small compared to the loss of the bandit algorithm, which is linear in time. Generally, the attack costs of ACE attack are bounded by O(log T ) and increase as the reward gap ∆ becomes larger. These verify the result of Theorem 4. On the other hand, the attack costs on Thompson Sampling and greedy are relatively smaller than that of UCB. This is because Thompson Sampling and greedy converges to the “optimal” arm very fast while the exploration for “nonoptimal” arm may still increase over time. Finally, compared
Data Poisoning Attacks on Stochastic Bandits
to the algorithmspeciﬁc attacks proposed by (Jun et al., 2018), the attack cost of ACE on greedy is slightly worse while the attack cost of ACE on UCB is much larger than Jun’s attack. In the case of attacking UCB algorithm, our universal attack strategy takes more cost than the algorithmspeciﬁc attack, but without the need to know the algorithm.
6. Conclusion
In this work, we study the open problem of data poisoning attacks on bandit algorithms. We propose an ofﬂine attack framework for the stochastic bandits and propose three algorithmspeciﬁc ofﬂine attack strategies against greedy, UCB and Thompson Sampling. Then, we study an online attack against the bandit algorithms and propose the adaptive attack strategy that can hijack the behavior of any bandit algorithm without requiring the knowledge of the bandit algorithm. Our theoretical results and simulations show that the bandit algorithms are vulnerable to the poisoning attacks in both online and ofﬂine setting.
Acknowledgements
This work has been supported in part by a grant from DTRA (HDTRA11410058), a grant from Ofﬁce of Naval Research (N000141712417), grants from National Science foundation (CNS1446582, CNS1518829, CNS1409336) and a grant from Army Research Ofﬁce (WN11NF1510277). This work has also been supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT), (2017000692, Transportaware Streaming Technique Enabling Ultra LowLatency AR/VR Services).
Conference on Learning Representations, 2015. URL http://arxiv.org/abs/1412.6572.
Gupta, A., Koren, T., and Talwar, K. Better algorithms for stochastic bandits with adversarial corruptions. arXiv preprint arXiv:1902.08647, 2019.
Huang, S., Papernot, N., Goodfellow, I., Duan, Y., and Abbeel, P. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
Jun, K.S., Li, L., Ma, Y., and Zhu, X. Adversarial attacks on stochastic bandits. arXiv preprint arXiv:1810.12188, 2018.
Korda, N., Kaufmann, E., and Munos, R. Thompson sampling for 1dimensional exponential family bandits. In Advances in Neural Information Processing Systems, pp. 1448–1456, 2013.
Li, B., Wang, Y., Singh, A., and Vorobeychik, Y. Data poisoning attacks on factorizationbased collaborative ﬁltering. In Advances in neural information processing systems, pp. 1885–1893, 2016.
Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. ACM, 2010.
Lin, Y.C., Hong, Z.W., Liao, Y.H., Shih, M.L., Liu, M.Y., and Sun, M. Tactics of adversarial attack on deep reinforcement learning agents. arXiv preprint arXiv:1703.06748, 2017.
References
Alfeld, S., Zhu, X., and Barford, P. Data poisoning attacks against autoregressive models. In AAAI, pp. 1452–1458, 2016.
Biggio, B., Nelson, B., and Laskov, P. Poisoning attacks against support vector machines. In Proceedings of the 29th International Conference on Machine Learning, pp. 1467–1474, 2012.
Chapelle, O., Manavoglu, E., and Rosales, R. Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology (TIST), 5(4):61, 2015.
Garivier, A., Lattimore, T., and Kaufmann, E. On explorethencommit strategies. In Advances in Neural Information Processing Systems, pp. 784–792, 2016.
Goodfellow, I., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In International
Liu, F., Wang, S., Buccapatnam, S., and Shroff, N. Ucboost: a boosting approach to tame complexity and optimality for stochastic bandits. arXiv preprint arXiv:1804.05929, 2018.
Lykouris, T., Mirrokni, V., and Paes Leme, R. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 114–122. ACM, 2018.
Ma, Y., Jun, K.S., Li, L., and Zhu, X. Data poisoning attacks in contextual bandits. arXiv preprint arXiv:1808.05760, 2018.
Mei, S. and Zhu, X. Using machine teaching to identify optimal trainingset attacks on machine learners. In AAAI, pp. 2871–2877, 2015.
Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
Data Poisoning Attacks on Stochastic Bandits
Wang, Y. and Chaudhuri, K. Data poisoning attacks against online learning. arXiv preprint arXiv:1808.08994, 2018.
Xiao, H., Biggio, B., Brown, G., Fumera, G., Eckert, C., and Roli, F. Is feature selection secure against training data poisoning? In International Conference on Machine Learning, pp. 1689–1698, 2015.
Categories
You my also like
Web Application Security Consortium: Threat Classification
443.9 KB7.6K1.3KSocial Engineering Penetration Testing
3.2 MB11.8K4.7KThe Price of Differential Privacy for Online Learning
480.3 KB26.4K8.4KModeling Adversarial Physical Movement in a Railway Station
3.7 MB18.8K6KScarica il Pdf
2.2 MB8.2K4.1KUn Bandit Qui Retourne Sa Veste
42.8 KB46.9K23.4KLecture 1: Introduction to regret analysis
2.2 MB44.4K14.6KShattered Trust: When Replacement Smartphone Components Attack
4.6 MB26.1K11.5KCompromising Security of Economic Dispatch in Power System
2.7 MB69K21.4K