Data Poisoning Attacks on Stochastic Bandits


Download Data Poisoning Attacks on Stochastic Bandits


Preview text

Data Poisoning Attacks on Stochastic Bandits

arXiv:1905.06494v1 [cs.LG] 16 May 2019

Fang Liu 1 Ness Shroff 1 2

Abstract
Stochastic multi-armed bandits form a class of online learning problems that have important applications in online recommendation systems, adaptive medical treatment, and many others. Even though potential attacks against these learning algorithms may hijack their behavior, causing catastrophic loss in real-world applications, little is known about adversarial attacks on bandit algorithms. In this paper, we propose a framework of offline attacks on bandit algorithms and study convex optimization based attacks on several popular bandit algorithms. We show that the attacker can force the bandit algorithm to pull a target arm with high probability by a slight manipulation of the rewards in the data. Then we study a form of online attacks on bandit algorithms and propose an adaptive attack strategy against any bandit algorithm without the knowledge of the bandit algorithm. Our adaptive attack strategy can hijack the behavior of the bandit algorithm to suffer a linear regret with only a logarithmic cost to the attacker. Our results demonstrate a significant security threat to stochastic bandits.
1. Introduction
Understanding adversarial attacks on machine learning systems is essential to developing effective defense mechanisms and an important step toward trustworthy artificial intelligence. A class of adversarial attacks on machine learning that have received much attention is data poisoning (Biggio et al., 2012; Mei & Zhu, 2015; Xiao et al., 2015; Alfeld et al., 2016; Li et al., 2016; Wang & Chaudhuri, 2018). Here, the attacker is able to access the leaner’s training data, and has the power to manipulate a fraction of the training data in order to make the learner satisfy certain objectives. This
1Department of Electrical and Computer Engineering, 2Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio, USA. Correspondence to: Fang Liu , Ness Shroff .
Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

is motivated by modern industrial scale applications of machine learning systems, where data collection and policy updates are done in a distributed way. Attacks can happen when the learner is not aware of the attacker’s access to the training data.
While there has been a line of research on adversarial attacks on deep learning (Goodfellow et al., 2015; Huang et al., 2017; Lin et al., 2017) and supervised learning (Biggio et al., 2012; Mei & Zhu, 2015; Xiao et al., 2015; Alfeld et al., 2016; Li et al., 2016), little is known on adversarial attacks on stochastic multi-armed bandits, which is a form of online learning with limited feedback. This is potentially hazardous since stochastic MAB are widely used in the industry to recommend news articles (Li et al., 2010), display advertisements (Chapelle et al., 2015), allocate medical treatment (Thompson, 1933) among many others. Hence, understanding the impact of adversarial attacks on bandit algorithms is an urgent yet open problem.
Recently, there has been an important piece of offline data poisoning attacks for the contextual bandit algorithm (Ma et al., 2018). They assume that the bandit algorithm updates periodically and that the attacker can manipulate the rewards in the data before the updates in order to hijack the behavior of the bandit algorithm. Consider the news recommendation as a running example for this offline attack model. A news website has K articles (i.e., arms or actions) and runs a bandit algorithm to learn a recommendation policy. Every time a user visits the website, the bandit algorithm displays an article based on historical data. Then the website receives a binary reward indicating whether the user clicks on the displayed article or not. The website keeps serving the users throughout the day and updates the bandit algorithm during the night. An attacker can perform offline data poisoning attacks before the bandit algorithm is updated. More specifically, the attacker may poison the rewards collected during the daytime and control the behavior of the bandit algorithm as it wants. The authors in (Ma et al., 2018) show that the offline attack strategy on LinUCB-type contextual bandit algorithm (Li et al., 2010) can be formulated as a convex optimization problem. However, offline attack strategies for classical bandit algorithms are still open.
In contrast to offline attacks, an online form of data poisoning attacks against bandit algorithms has also been proposed

Data Poisoning Attacks on Stochastic Bandits

by (Jun et al., 2018). They assume that the bandit algorithm updates step by step and the attacker sits in-between the bandit environment and the bandit algorithm. At each time step, the bandit algorithm pulls an arm and the attacker eavesdrops on the decision. Then the attacker makes an attack by manipulating the reward generated from the bandit environment. The bandit algorithm receives the poisoned reward without knowing the presence of the attacker and updates accordingly. The goal of the attacker is to control which arm appears to the bandit algorithm as the best arm at the end. Efficient attack strategies against -greedy and Upper Confidence Bounds (UCB) algorithms are proposed by (Jun et al., 2018). However, online attack strategies for other bandit algorithms (e.g., Thompson Sampling (Thompson, 1933) and UCBoost (Liu et al., 2018)) are still unknown.
In this work, we have a systematic investigation of data poisoning attacks against bandit algorithms and address the aforementioned open problems. We study the data poisoning attacks in both the offline and the online cases. In the offline setting, the bandit algorithm updates periodically and the attacker can manipulate the rewards in the collected data before the update occurs. In the online setting, the attacker eavesdrops on the bandit algorithm and manipulates the feedback reward. The goal of the attacker is that the bandit algorithm considers the target arm as the optimal arm at the end. Specifically, we make the following contributions to data poisoning attacks on stochastic bandits.
1. We propose an optimization based framework for offline attacks on bandit algorithms. Then, we instantiate three offline attack strategies against -greedy, UCB algorithm and Thompson Sampling, which are the solutions of the corresponding convex optimization problems. That is, there exist efficient attack strategies for the attacker against these popular bandit algorithms.
2. We study the online attacks on bandit algorithms and propose an adaptive attack strategy that can hijack any bandit algorithm without knowing the bandit algorithm. As far as we know, this is the first negative result showing that there is no robust and good stochastic bandit algorithm that can survive online poisoning attacks.
3. We evaluate our attack strategies by numerical results. Our attack strategies are efficient in forcing the bandit algorithms to pull a target arm at a relatively small cost. Our results expose a significant security threat as bandit algorithms are widely employed in the real world applications.
More recently, there is a line of works by (Lykouris et al., 2018; Gupta et al., 2019) studying an adversarial corruption model. While they assume that the attacker has to attack all the arms before the learner chooses an arm, we consider the

case where the attacker’s strategy is aware of and adaptive to the decision of the learner. The difference leads to opposite results. While they propose a robust bandit algorithm, our work shows that the existing algorithms are vulnerable to the adversarial attacks.
This paper is organized as the following. In Section 2, we introduce the problem setting and attack system models. Then, we propose the offline attack problem in Section 3. In Section 4, we study the online attack problem and propose an adaptive attack strategy. The numerical results to evaluate our attack strategies are provided in Section 5. Finally, we conclude the paper with discussion in Section 6.

2. Problem Formulation

We consider the classical stochastic bandit setting. Suppose that there is a set A = {1, 2, . . . , K} of K arms and the bandit algorithm proceeds in discrete time t = 1, 2, . . . , T . At each round t, the algorithm pulls an arm at ∈ A and the bandit environment generates a reward rt ∈ R such that

rt = µat + ηt,

(1)

where ηt is a zero-mean, σ-subGaussian noise and µat is the instantaneous reward at time t. In other words, the expected reward of pulling arm a is µa. Note that {µa}a∈A
are unknown to the bandit algorithm and the attacker.

The performance of the bandit algorithm is measured by
the regret, which is the expected difference between the
total rewards obtained by an oracle that always pulls the
optimal arm (i.e., the arm with the largest expected reward maxa∈A µa) and the accumulative rewards collected by the bandit algorithm up to time horizon T . Formally, the regret R(T ) is given by

T

R(T ) = E max µaT − rt .

(2)

a∈A t=1

In this work, we consider the uniformly good bandit algorithm that incurs sub-linear regret, i.e., R(T ) = o(T ).
For each reward rt returned from the bandit environment, the attacker manipulates the reward into

rt = rt + t.

(3)

Then the accumulated attack cost of the attacker, C(T ), is measured by the norm of the vector ( 1, . . . , T )T . For simplicity, we consider the l2-norm in the offline setting and the l1-norm in the online setting. Note that the results in
this work can be easily extended to any norm. For example, consider the lp-norm, the total attack cost of the attacker is

T

1/p

C(T ) =

| t|p

.

(4)

t=1

Data Poisoning Attacks on Stochastic Bandits

a t < l a t e x i t s h a 1 _ b a s e 6 4 = " n 4 f Y N Z o W f + O 7 S E 0 1 l O y 5 I L Q C Y / k = " > A A A B 6 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E U G 8 F P X i s a G y h D W W z 3 b R L N 5 u w O x F K 6 E / w 4 k H F q / / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 M J X C o O t + O 6 W V 1 b X 1 j f J m Z W t 7 Z 3 e v u n / w a J J M M + 6 z R C a 6 H V L D p V D c R 4 G S t 1 P N a R x K 3 g p H 1 1 O / 9 c S 1 E Y l 6 w H H K g 5 g O l I g E o 2 i l e 9 r D X r X m 1 t 0 Z y D L x C l K D A s 1 e 9 a v b T 1 g W c 4 V M U m M 6 n p t i k F O N g k k + q X Q z w 1 P K R n T A O 5 Y q G n M T 5 L N T J + T E K n 0 S J d q W Q j J T f 0 / k N D Z m H I e 2 M 6 Y 4 N I v e V P z P 6 2 Q Y X Q a 5 U G m G X L H 5 o i i T B B M y / Z v 0 h e Y M 5 d g S y r S w t x I 2 p J o y t O l U b A j e 4 s v L x D + r X 9 X d u / N a 4 6 Z I o w x H c A y n 4 M E F N O A W m u A D g w E 8 w y u 8 O d J 5 c d 6 d j 3 l r y S l m D u E P n M 8 f v B O N o Q = = < / l a t e x i t >

bandit algorithm

attacker

{~✏a < l a t e x i t s h a 1 _ b a s e 6 4 = " m p c O l G B I d M l i J g v Q 3 6 w b d h K 6 G A Q = " > A A A B / H i c b V B N S 8 N A E N 3 U r 1 q / 4 s f N S 7 A I n k o q g n o r e v F Y w W i h C W G z n b R L N 7 t h d 1 O o I f h X v H h Q 8 e o P 8 e a / c d v m o K 0 P B h 7 v z T A z L 0 o Z V d p 1 v 6 3 K 0 v L K 6 l p 1 v b a x u b W 9 Y + / u 3 S u R S Q I e E U z I T o Q V M M r B 0 1 Q z 6 K Q S c B I x e I i G 1 x P / Y Q R S U c H v 9 D i F I M F 9 T m N K s D Z S a B / 4 u T 8 C k v u Q K s o E L 0 L s F 6 F d d x v u F M 4 i a Z a k j k q 0 Q / v L 7 w m S J c A 1 Y V i p b t N N d Z B j q S l h U N T 8 T E G K y R D 3 o W s o x w m o I J 9 e X z j H R u k 5 s Z C m u H a m 6 u + J H C d K j Z P I d C Z Y D 9 S 8 N x H / 8 7 q Z j i + C n P I 0 0 8 D J b F G c M U c L Z x K F 0 6 M S i G Z j Q z C R 1 N z q k A G W m G g T W M 2 E 0 J x / e Z F 4 p 4 3 L h n t 7 V m 9 d l W l U 0 S E 6 Q i e o i c 5 R C 9 2 g N v I Q Q Y / o G b 2 i N + v J e r H e r Y 9 Z a 8 U q Z / b R H 1 i f P 8 H 1 l Z 4 = < / l a t e x i t >

}

data buffer of size T

bandit environment

( a t < l a t e x i t s h a 1 _ b a s e 6 4 = " r 1 y + / y t P O g S m 2 s V t U 7 + i 0 7 T 0 + N E = " > A A A B 7 3 i c b V B N S 8 N A E N 3 4 W e t X 1 a O X x S J U k J K I o N 4 K e v B Y w d h K G 8 J m u 2 m X 7 m 7 C 7 k Q o o b / C i w c V r / 4 d b / 4 b t 2 0 O 2 v p g 4 P H e D D P z o l R w A 6 7 7 7 S w t r 6 y u r Z c 2 y p t b 2 z u 7 l b 3 9 B 5 N k m j K f J i L R 7 Y g Y J r h i P n A Q r J 1 q R m Q k W C s a X k / 8 1 h P T h i f q H k Y p C y T p K x 5 z S s B K j z U S w q k O 4 S S s V N 2 6 O w V e J F 5 B q q h A M 6 x 8 d X s J z S R T Q A U x p u O 5 K Q Q 5 0 c C p Y O N y N z M s J X R I + q x j q S K S m S C f H j z G x 1 b p 4 T j R t h T g q f p 7 I i f S m J G M b K c k M D D z 3 k T 8 z + t k E F 8 G O V d p B k z R 2 a I 4 E x g S P P k e 9 7 h m F M T I E k I 1 t 7 d i O i C a U L A Z l W 0 I 3 v z L i 8 Q / q 1 / V 3 b v z a u O m S K O E D t E R q i E P X a A G u k V N 5 C O K J H p G r + j N 0 c 6 L 8 + 5 8 z F q X n G L m A P 2 B 8 / k D T J m P n w = = < / l a t e x i t >

,

rt)

a t < l a t e x i t s h a 1 _ b a s e 6 4 = " n 4 f Y N Z o W f + O 7 S E 0 1 l O y 5 I L Q C Y / k = " > A A A B 6 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E U G 8 F P X i s a G y h D W W z 3 b R L N 5 u w O x F K 6 E / w 4 k H F q / / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 M J X C o O t + O 6 W V 1 b X 1 j f J m Z W t 7 Z 3 e v u n / w a J J M M + 6 z R C a 6 H V L D p V D c R 4 G S t 1 P N a R x K 3 g p H 1 1 O / 9 c S 1 E Y l 6 w H H K g 5 g O l I g E o 2 i l e 9 r D X r X m 1 t 0 Z y D L x C l K D A s 1 e 9 a v b T 1 g W c 4 V M U m M 6 n p t i k F O N g k k + q X Q z w 1 P K R n T A O 5 Y q G n M T 5 L N T J + T E K n 0 S J d q W Q j J T f 0 / k N D Z m H I e 2 M 6 Y 4 N I v e V P z P 6 2 Q Y X Q a 5 U G m G X L H 5 o i i T B B M y / Z v 0 h e Y M 5 d g S y r S w t x I 2 p J o y t O l U b A j e 4 s v L x D + r X 9 X d u / N a 4 6 Z I o w x H c A y n 4 M E F N O A W m u A D g w E 8 w y u 8 O d J 5 c d 6 d j 3 l r y S l m D u E P n M 8 f v B O N o Q = = < / l a t e x i t >

bandit algorithm
attacker

✏< l a t e x i t s h a 1 _ b a s e 6 4 = " C 0 D S r 4 2 Q Q i r t G n N m o 4 v 9 x B W / b P Q = " > A A A B 8 H i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b B U 0 l E U G 8 F P X i s Y G y x D W W z n b R L N 5 u w O x F K 6 L / w 4 k H F q z / H m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 M J X C o O t + O 6 W V 1 b X 1 j f J m Z W t 7 Z 3 e v u n / w Y J J M c / B 5 I h P d D p k B K R T 4 K F B C O 9 X A 4 l B C K x x d T / 3 W E 2 g j E n W P 4 x S C m A 2 U i A R n a K X H L q R G y E T 1 s F e t u X V 3 B r p M v I L U S I F m r / r V 7 S c 8 i 0 E h l 8 y Y j u e m G O R M o + A S J p V u Z i B l f M Q G 0 L F U s R h M k M 8 u n t A T q / R p l G h b C u l M / T 2 R s 9 i Y c R z a z p j h 0 C x 6 U / E / r 5 N h d B n k Q q U Z g u L z R V E m K S Z 0 + j 7 t C w 0 c 5 d g S x r W w t 1 I + Z J p x t C F V b A j e 4 s v L x D + r X 9 X d u / N a 4 6 Z I o 0 y O y D E 5 J R 6 5 I A 1 y S 5 r E J 5 w o 8 k x e y Z t j n B f n 3 f m Y t 5 a c Y u a Q / I H z + Q N I k Z D c < / l a t e x i t > t

bandit environment

rt0< l a t e x i t s h a 1 _ b a s e 6 4 = " M 4 U / n I c l + 4 w F d / l E w Q X + v N / 2 D t A = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b R U 0 l F U G 8 F P X i s Y G y h D W W z 3 b R L d 5 O w O x F K 6 F / w 4 k H F q 7 / I m / / G T Z u D t j 4 Y e L w 3 w 8 y 8 I J H C o O t + O 6 W V 1 b X 1 j f J m Z W t 7 Z 3 e v u n / w a O J U M + 6 x W M a 6 E 1 D D p Y i 4 h w I l 7 y S a U x V I 3 g 7 G N 7 n f f u L a i D h 6 w E n C f U W H k Q g F o 5 h L + r S P / W r N r b s z k G X S K E g N C r T 6 1 a / e I G a p 4 h E y S Y 3 p N t w E / Y x q F E z y a a W X G p 5 Q N q Z D 3 r U 0 o o o b P 5 v d O i U n V h m Q M N a 2 I i Q z 9 f d E R p U x E x X Y T k V x Z B a 9 X P z P 6 6 Y Y X v m Z i J I U e c T m i 8 J U E o x J / j g Z C M 0 Z y o k l l G l h b y V s R D V l a O O p 2 B A a i y 8 v E + + 8 f l 1 3 7 y 9 q z d s i j T I c w T G c Q Q M u o Q l 3 0 A I P G I z g G V 7 h z V H O i / P u f M x b S 0 4 x c w h / 4 H z + A D a T j e M = < / l a t e x i t > rt < l a t e x i t s h a 1 _ b a s e 6 4 = " a J W o N E K S m f Y / G u V K x + U T 0 c 8 Q N V Q = " > A A A B 6 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E U G 8 F P X i s a G y h D W W z 3 b R L N 5 u w O x F K 6 E / w 4 k H F q / / I m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 M J X C o O t + O 6 W V 1 b X 1 j f J m Z W t 7 Z 3 e v u n / w a J J M M + 6 z R C a 6 H V L D p V D c R 4 G S t 1 P N a R x K 3 g p H 1 1 O / 9 c S 1 E Y l 6 w H H K g 5 g O l I g E o 2 i l e 9 3 D X r X m 1 t 0 Z y D L x C l K D A s 1 e 9 a v b T 1 g W c 4 V M U m M 6 n p t i k F O N g k k + q X Q z w 1 P K R n T A O 5 Y q G n M T 5 L N T J + T E K n 0 S J d q W Q j J T f 0 / k N D Z m H I e 2 M 6 Y 4 N I v e V P z P 6 2 Q Y X Q a 5 U G m G X L H 5 o i i T B B M y / Z v 0 h e Y M 5 d g S y r S w t x I 2 p J o y t O l U b A j e 4 s v L x D + r X 9 X d u / N a 4 6 Z I o w x H c A y n 4 M E F N O A W m u A D g w E 8 w y u 8 O d J 5 c d 6 d j 3 l r y S l m D u E P n M 8 f 1 e i N s g = = < / l a t e x i t >

Figure 1: Offline attack system model

Figure 2: Online attack system model

Without loss of generality, we assume that arm a∗ is a suboptimal attack target, such that µa∗ < maxa∈A µa. The attacker’s goal is to manipulate the bandit algorithm into pulling arm a∗ frequently. To avoid being detected, the attacker also wants to keep the cost as small as possible.
2.1. Offline attack system model
The offline attack system model is illustrated in Figure 1. Besides the bandit algorithm, the bandit environment and the attacker, there is a data buffer of size T . This models the setting where updates of the bandit algorithm happen in mini-batches of size T . For round t = 1, . . . , T , the bandit algorithm sequentially pulls arm at. Then the environment generates the reward rt according to Equation (1) and send the tuple (at, rt) to the data buffer. The data buffer stores the data stream until round T . At the end of round T , the attacker accesses the data buffer and poisons the data by Equation (3). Finally, the data buffer sends the poisoned data stream {(at, rt)}t≤T to the bandit algorithm and the bandit algorithm updates according to the received data without knowing the existence of the attacker.
The goal of the attacker in the offline setting is to force the bandit algorithm to pull the target arm a∗ with high probability at round T +1 (i.e., after updating with poisoned data) while incurring only a small cost. This means that the attacker wants to make the poisoning effort t as small as possible to keep stealthy.
2.2. Online attack system model
The online attack system model is illustrated in Figure 2. In the online setting, the bandit algorithm updates instantly for each time step. The attacker stealthily monitors the decision of the bandit algorithm at at each time t and poisons the reward signal returned from the bandit environment by equation (3). Then the bandit algorithm receives the manipulated reward signal rt and updates unaware of the attacker.
The goal of the attacker in the online setting is to hijack the behavior of the bandit algorithm with high probability by manipulating the reward signal so that the bandit algorithm

pulls the target arm a∗ in Θ(T ) time steps. In the meantime, the attacker wants to control its attack cost by poisoning as infrequently as possible in order to avoid being detected. Note that t = 0 is considered as no attack.
By the linearity of expectation and ηt is a zero-mean noise,

R(T ) =

max µi − µa E[Na(T )], (5)

i∈A

a∈A

where Na(t) is the number of pulling arm a up to time t. Thus, the attack goal means that the attacker wants the bandit algorithm to incur a linear regret by incurring only a
sub-linear attack cost.

3. Offline Attacks
In this section, we introduce the offline attack framework to stochastic bandits. The updates of the bandit algorithm happen in mini-batches of size T 1. Between these consecutive updates, the bandit algorithm follows a fixed algorithm obtained from the last update. This allows the attacker to poison the historical data before the update and force the algorithm to pull a target arm a∗ with high probability.
Let ma be the number of times arm a was pulled up to time T , i.e., ma := Na(T ). For each arm a ∈ A, let ya ∈ Rma be the corresponding reward vector returned from the bandit environment when arm a was pulled. That is, ya := (rt : at = a)T . Let a ∈ Rma be the poisoning attack strategy of the attacker, i.e., a := ( t : at = a)T . The poisoned reward vector for arm a after the attack becomes ya + a. To avoid being detected, the attacker hopes to make the poisoning a as small as possible. Without loss of generality, we consider the l2-norm attack cost in the offline attacks. We have that

T

C(T )2 =

2 t

=

|| a||22.

(6)

t=1

a∈A

Thus, the attacker’s offline attack problem can be formulated 1The batch size T is a relatively large integer compared to K.

Data Poisoning Attacks on Stochastic Bandits

as the following optimization problem P ,

P : min

|| a||22

(7)

a:a∈A a∈A

s.t. P{aT +1 = a∗} ≥ 1 − δ,

(8)

for some error tolerance δ > 0. Note that we define the attack goal as forcing the bandit algorithm to pull arm a∗ at the next round with high probability. This is because there are some randomized bandit algorithms, such as -greedy and Thompson Sampling. It is not feasible to force the randomized algorithm to pull a target arm with probability 1. But it is possible to hijack the behavior of the randomized algorithm with high probability.
Proposition 1. Given some error tolerance δ > 0. If { ∗a}a∈A is the optimal solution of problem P , then it is the optimal offline attack strategy for the attacker.

The proof of Proposition 1 follows from equation (6) and the definition of offline attacks. Note that the constraint of problem P depends on the bandit algorithm. Now, we assume that the attacker knows the bandit algorithm and we introduce algorithm-specific offline attack strategies derived from the optimization problem P for three popular bandit algorithms, -greedy, UCB and Thompson Sampling. For simplicity, we denote the post-attack empirical mean observed by the bandit algorithm at the end of round t as

1t

µ˜a(t) := Na(t) rτ 1{aτ = a}.

(9)

τ =1

3.1. Offline attack on -greedy
Now, we derive the offline attack strategy for the -greedy algorithm, which is a randomized algorithm with some decreasing rate function αt. Typically, αt = Θ(1/t). At each time t, the -greedy algorithm pulls an arm
at = draw uniformly over A, w.p. αt . (10) arg maxa∈A µ˜a(t − 1), otherwise

At time step T + 1, the -greedy algorithm uniformly samples an arm from the arm set A with probability αT +1, which cannot be controlled by the attacker. Then, we set the error tolerance as δ = KK−1 αT +1 since the target arm a∗ may also be pulled in the uniform sampling. Thus, the attacker poisons the reward such that the target arm a∗
has the largest empirical mean. After the attack, the greedy algorithm pulls an arm aT +1 at time T + 1 such that P{aT +1 = a∗} = 1 − δ. In order to make the target arm the unique arm with the largest empirical mean, we introduce a margin parameter ξ > 0. So the attacker’s optimization
problem for attacking -greedy algorithm is the following

problem P1,

P1 : min

|| a||22

(11)

a:a∈A a∈A

s.t. µ˜a∗ (T ) ≥ µ˜a(T ) + ξ, ∀a = a∗, (12)

where µ˜a(T ) = (ya + a)T 1/ma and 1 is the vector that each element is 1. The condition in the above optimization
implies that the -greedy algorithm will play the target arm a∗ at the time step T + 1 with probability of at least 1 − KK−1 αT +1. The following result shows that there exists at least one optimal solution of problem P1, i.e., one optimal offline attack for the -greedy algorithm.
Theorem 1. Given any margin parameter ξ > 0. For any reward instance {ya}a∈A, there exists at least one optimal solution of problem P1, which is a quadratic program with linear constraints. Hence, there exists at least one optimal
offline attack for the -greedy algorithm.

The proof of Theorem 1 is provided in Section A.1 in the
supplementary material. Note that the error tolerance parameter, δ = KK−1 αT +1, depends on the rate function αt of the -greedy algorithm, which is not controllable by the at-
tacker. This counts the exploration introduced by the bandit
algorithm’s inner randomization, which can not be manip-
ulated by the attacker. However, the attacker can wait for some large enough T since the rate function αt goes to zero finally. Moreover, the attacker’s strategy (problem P1) does not depend on the rate function, i.e., the attacker does not
require the knowledge of the parameter of the -greedy.

3.2. Offline attack on UCB
Now we derive the offline attack strategy for the classical UCB algorithm. For each time t, the UCB algorithm pulls the arm with the highest UCB index:

at = arg max ua(t) := µ˜a(t−1)+3σ
a∈A

log t Na(t − 1) . (13)

The UCB algorithm pulls the target arm a∗ at time T + 1 if and only if the UCB index of arm a∗ is the unique largest
one. Thus, the attacker can manipulate the rewards to satisfy the condition. Given any margin parameter ξ > 0, the
attacker’s optimization problem becomes

P2 : min
a :a∈A
s.t.

|| a||22
a∈A
ua∗ (T + 1) ≥ ua(T + 1) + ξ,

(14) ∀a = a∗

The condition in the above optimization implies that the UCB algorithm will pull the target arm after the poisoning attack. The following result shows that there exists at least one optimal solution of problem P2, i.e., one optimal offline attack for the UCB algorithm.

Data Poisoning Attacks on Stochastic Bandits

Theorem 2. Given any margin parameter ξ > 0. For any reward instance {ya}a∈A, there exists at least one optimal solution of problem P2, which is a quadratic program with linear constraints. Hence, there exists at least one optimal offline attack for the UCB algorithm.
The proof of Theorem 2 is similar to the proof of Theorem 1. Note that the above attack strategy holds for any error tolerance δ since UCB algorithm is not randomized.
3.3. Empirical mean based bandit algorithms
One insight from our offline attacks on the -greedy algorithm and the UCB algorithm is that the empirical mean based algorithms are vulnerable to attack. This is because the empirical mean related constraints are linear and non-empty. Then the optimization problem P becomes a quadratic problem with non-empty linear constraints, which can be solved efficiently. This result holds for many other variants of UCB algorithms and Explore-Then-Commit algorithms (Garivier et al., 2016) (which uniformly samples the arm set in the first exploration phase and commit to the arm with the largest empirical mean in the second commitment phase).
3.4. Beyond empirical means: Thompson Sampling for Gaussian distributions
We have shown that the empirical mean based algorithms are not secure against the offline attack. It would appear that Bayesian algorithms should be more robust to the offline attack as the constraint of problem P becomes non-tractable. Unfortunately, we find that Thompson Sampling, a popular Bayesian algorithm, is also vulnerable to the data poisoning.
Now, we derive the attack strategy for Thompson Sampling for Gaussian distributions2. In other words, the noise ηt is i.i.d. sampled from Gaussian distribution N (0, σ2). Thompson Sampling for Gaussian distribution with the Jeffreys prior (Korda et al., 2013) works as the following. For each time t, the algorithm samples θa(t) from the posterior distribution N (µ˜a(t − 1)/σ2, σ2/Na(t − 1)) for each arm a independently. Then the algorithm pulls the arm with the highest sample value, i.e., at = arg maxa∈A θa(t).
Let φ(x) = √12π e−x2/2 be the probability density function (pdf) of the standard Gaussian distribution and Φ(x) be the corresponding cumulative distribution function (cdf). For simplicity, we denote φa(x) as the pdf of the Gaussian distribution N (µ˜a(T )/σ2, σ2/Na(T )) for each arm a and Φa(x) as the corresponding cdf. Since we are interested in the samples in time T + 1, we omit time index in the
2Thompson Sampling needs distribution models and Gaussian distribution is popular and well-studied in the literature. The idea of problem relaxation here can be extended to other distributions.

following discussion. By the law of total expectation, we have that

P{aT +1 = a∗} = P{∩a=a∗ θa∗ > θa}

(15)

+∞

=

P{∩a=a∗ θa < x|θa∗ = x}φa∗ (x) dx

−∞

+∞

=

(Πa=a∗ P{θa < x|θa∗ = x}) φa∗ (x) dx

−∞

+∞

=

(Πa=a∗ Φa(x)) φa∗ (x) dx.

(16)

−∞

The Equation (16) is complicated to compute and analyze, which makes the instantiation of the offline attack problem P non-trivial. To address this challenge, we derive a sufficient constraint of the original constraint and make a relaxation of the original problem P . By the union bound, we have that

P{aT +1 = a∗} = P{∪a=a∗ θa∗ < θa} ≤

P{θa∗ < θa}

a=a∗

=

Φ µ˜a(T ) − µ˜a∗ (T )

(17)

a=a∗ σ3 1/ma + 1/ma∗

Thus, we are ready to introduce the attacker’s problem for Thompson Sampling for Gaussian distributions,

P3 : min
a :a∈A
s.t.

|| a||22

(18)

a∈A

Φ
a=a∗

µ˜a(T ) − µ˜a∗ (T ) σ3 1/ma + 1/ma∗

≤ δ (19)

µ˜a(T ) − µ˜a∗ (T ) ≤ 0, ∀a = a∗ (20)

The constraint of problem P3 is a sufficient condition to the constraint of the problem P by Equation (17). Note that the linear constraints (20) are redundant since Φ(0) = 0.5 and δ < K2−1 is usually satisfied. The following proposition shows that the constraint set of problem P3 is convex.
Proposition 2. The constraint set formed by Equations (19) and (20) is convex.

The proof of Proposition 2 is provided in Section A.3 in the supplementary material. The following result shows that there exists at least one optimal solution of problem P3 that is a feasible offline attack for the Thompson Sampling for Gaussian distributions.
Theorem 3. Given any error tolerance δ > 0. For any reward instance {ya}a∈A, there exists at least one optimal solution of problem P3, which is a quadratic program with convex constraints. Hence, there exists at least one feasible offline attack for the Thompson Sampling algorithm for Gaussian distributions.

Data Poisoning Attacks on Stochastic Bandits

By Proposition 2, the proof of Theorem 3 is similar to the
proof of Theorem 1. Note that the above attack strategy is not the optimal attack strategy formulated by P . However, it is easy to compute since problem P3 is a quadratic program with convex constraints. Another relaxation of problem P is presented in Section A.4 in the supplementary material.

4. Online Attacks

In this section, we study online attacks against stochastic
bandits. In the online attack setting, the bandit algorithm
updates its policy at each round. The attacker eavesdrops on the decision (i.e., at) of the bandit algorithm and poisons the reward by adding an arbitrary t ∈ R. Hence the reward observed by the bandit algorithm is rt = rt + t. Without loss of generality, we consider the l1-norm attack cost. That is, the cost of the attacker for round t is | t|. Recall that Na(t) is the number of pulling arm a up to time t.
Without the attacks, the bandit algorithm is a uniformly good policy such that it achieves O(log T ) regret3, i.e., E[ a(maxi µi − µa)Na(T )] = O(log T ) for any problem instance {µa}a∈A. Moreover, the expected number of pulling the optimal arm (with the highest expected reward) is T − o(T ).

The goal of the attacker is to force the bandit algorithm to pull the sub-optimal target arm a∗ as much as possible and

pays the least possible total cost. Formally, the attacker

wants the bandit algorithm to receive linear expected regret,

i.e., E[Na∗ (T )] = T − o(T ), and pays the expected total

cost E[

T t

|

t|]

=

O(log T ).

In

other

words,

the

attacker

wants to manipulate the rewards so that the bandit algorithm

considers the target arm as the best arm in the long term.

4.1. Oracle constant attacks

The fact that the attacker does not know the expected rewards {µa}a∈A is challenging because otherwise the attacker can perform the attack trivially. Suppose the attacker knows the expected rewards, then the attacker can choose the following oracle attack strategy,

t = −1{at = a∗}[µat − µa∗ + ξ]+, (21)

for some margin parameter ξ > 0. Note that [·]+ indicates

the function such that [x]+ = max{x, 0} and 1{·} is the

indicator function. By this attack, the bandit algorithm sees a poisoned bandit problem, where the target arm a∗ is the

optimal arm and all the other arms are at least ξ below

the target arm. Thus, the bandit algorithm pulls the target

arm with E[Na∗ (T )] = T − o(T ) and pays the total cost

E[

T t

|

t|]

=

O(log T )

since

t are bounded. This result

has been shown in (Jun et al., 2018) as the following.

3The results in this section directly apply to the problemindependent regret bounds and high probability bounds.

Proposition 3. (Proposition 1 in (Jun et al., 2018)) Assume that the bandit algorithm achieves an O(log T ) regret bound. Then the oracle attack with ξ > 0 succeeds, i.e., E[Na∗ (T )] = T − o(T ). Furthermore, the expected attack cost is O( i=a∗ [µi − µa∗ + ξ]+ log T ).
The insight obtained from Proposition 3 is that the attacker does not need to attack in round t if at = a∗. This helps us to design attack strategies that are similar to the oracle attack. When the ground truth {µa}a∈A is not known to the attacker, the attacker may guess some upper bound {Ca}a=a∗ on {[µa−µa∗ ]+}a=a∗ and perform the following oracle constant attack,
t = −1{at = a∗}Cat . (22)
The following result shows the sufficient and necessary conditions for the oracle constant attack to be successful.
Proposition 4. Assume that the bandit algorithm achieves an O(log T ) regret bound. Then the constant attack with {Ca}a=a∗ succeeds if and only if Ca > [µa − µa∗ ]+, ∀a = a∗. If the attack succeeds, then the expected attack cost is O( a=a∗ Ca log T ).
The proof of Proposition 4 is provided in Section B.1 in the supplementary material. Proposition 4 shows that the attacker has to know the unknown bounds on {[µa−µa∗ ]+}a=a∗ to guarantee a successful constant attack. Moreover, the oracle constant attack is non-adaptive to the problem instance since some upper bound Ca can be much larger than the quantity [µa − µa∗ ]+ so that the attacker is paying unnecessary attack cost compared to the oracle attack. To address this challenge, we propose an adaptive constant attack that can learn the bandit environment in an online fashion and perform the attack adaptively.

4.2. Adaptive attack by constant estimation

Now, we are ready to propose the adaptive attack strategy.
The idea is that the attacker can update upper bounds on the unknown quantity {[µa − µa∗ ]+}a=a∗ based on the empirical mean observed by the attacker. Then the attacker
performs the constant attack based on the estimated upper bounds. Note that the attacker observes the tuple (at, rt) at each time t and is able to obtain an unbiased empirical mean. Let µˆa(t) be the pre-attack empirical mean observed by the attacker at time t, that is

1t

µˆa(t) := Na(t) rτ 1{aτ = a}.

(23)

τ =1

Given any δ ∈ (0, 1), we define a function β(n) as

β(n) = 2σ2 log π2Kn2 . (24)

n



Data Poisoning Attacks on Stochastic Bandits

Then, the following result shows that {β(Na(t))}a∈A is a high-probability bound on the pre-attack empirical mean of all arms on all rounds. Define the event

E = {∀a ∈ A, ∀t : |µˆa(t) − µa| < β(Na(t))}. (25)

Lemma 1. (Lemma 1 in (Jun et al., 2018)) For δ ∈ (0, 1), P(E) > 1 − δ.

The proof of Lemma 1 follows from Hoeffding inequality
and union bound. Lemma 1 shows that with probability at least 1 − δ, we have that the expected reward µa is within the interval (µˆa(t) − β(Na(t)), µˆa(t) + β(Na(t))) centered at the pre-attack empirical mean µˆa(t). Thus, we obtain an upper bound on the quantity µa − µa∗ and propose the following Adaptive attack by Constant Estimation (ACE),

t = −1{at = a∗}[µˆa (t)−µˆa∗(t)+β(Na (t))+β(Na∗(t))]+.

t

t

The following shows that the ACE is successful for any bandit algorithm without the knowledge of the bandit algorithm.
Theorem 4. Given any δ ∈ (0, 0.5), assume that the bandit algorithm achieves an O(log T ) regret bound with probability at least 1 − δ. With probability at least 1 − 2δ, the ACE attacker forces the bandit algorithm to play the target arm a∗ in Na∗ (T ) times, such that Na∗ (T ) = T − o(T ), using an instantaneous attack cost,

| t| ≤ 1{at = a∗} [µa −µa∗ ]++2β(Na (t))+2β(Na∗(t)) .

t

t

In particular, with probability at least 1−2δ, the ACE attack is successful and the accumulated attack cost is

5. Numerical Results
In this section, we run simulations on attacking -greedy, UCB and Thompson Sampling algorithms to illustrate our theoretical results. All the simulations are run in MATLAB and the codes can be found in the supplemental materials. Note that we implement -greedy with αt = 1t .
5.1. Offline attacks
To study the effectiveness of the offline attacks, we consider the following experiment. The bandit has K = 5 arms and the reward noise is a Gaussian distribution N (0, σ2) with σ = 0.1. The attacker’s target arm is arm K such that µK = 0, while the expected rewards of other arms are uniformly distributed in the unit interval [0, 1]. We set T = 1000 and the error tolerance to δ = 0.05.
In each attack trial, we first generate the instance of the bandit by drawing the expected rewards from the uniform distribution on [0, 1]. Then we run the bandit algorithm for T rounds and collect all the historical data. Without any attack, the bandit algorithm would have converged to some optimal arm in the trial, which is not the target arm as the target arm is the one with the least payoff. Then we set the margin parameter as ξ = 0.001 and run the corresponding offline attacks. The attack strategy is the solution of the optimization problem Pi. Since all the problems are quadratic program with linear (convex) constraints, they can be solved by standard optimization tools.
We run 1000 attack trials. Note that the attack cost depends on the instance of the bandit. To evaluate the attack cost, we use the poisoning effort ratio (Ma et al., 2018):





T

| t| ≤ O 

[µa − µa∗ ]+ + 4β(1) log T  .

t=1

a=a∗

The proof of Theorem 4 is provided in Section B.2 in the supplementary material. Theorem 4 shows that the ACE is universally successful for any bandit algorithm, without knowing any prior information on {µa}a∈A. Besides, the ACE incurs an high-probability accumulated attack cost as small as that of the oracle attack4 with only an additional bounded additive term, O(4β(1)K log T ). That is, the ACE is close to the hindsight-oracle attack strategy. Moreover, the ACE even requires no knowledge of the bandit algorithm. This is an advantage over the algorithm-dependent online attack strategies designed in (Jun et al., 2018) since the attacker may not know which bandit algorithm the learner is in practice. As far as we know, this is the first negative result showing that there is no robust bandit algorithm that can be immune to the adaptive online attack. This exposes a significant security threat to the bandit learning systems.

|| ||2 =

a∈A || a||22 .

(26)

||y||2

a∈A ||ya||22

The poisoning effort ratio measures the ratio of the total cost over the norm of the original rewards. To evaluate the attack effectiveness, we check whether the poisoned data satisfy the constraint of the optimization problem P . If so, the bandit algorithm will play the target arm with probability at least 1 − δ.
Figure 3 shows the results of the attack against the -greedy, UCB and Thompson Sampling. First, the attack strategies are effective as all the attacks are successful. Second, the attack costs are small as shown in the histograms. The ratios of attacking -greedy, UCB and Thompson Sampling are less than 10%, 2% and 5%.

Number of trials

Data Poisoning Attacks on Stochastic Bandits

180 160 140 120 100
80 60 40 20
0 0

Success rate =100%

0.02

0.04

0.06

0.08

0.1

Poisoning effort ratio

(a) Attack on -greedy algorithm

Number of trials

120 Success rate =100%
100

80

60

40

20

0

0

0.005

0.01

0.015

0.02

0.025

Poisoning effort ratio

(b) Attack on UCB algorithm

Number of trials

140 Success rate =100%
120
100
80
60
40
20
0 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
Poisoning effort ratio
(c) Attack on Thompson Sampling

Figure 3: Histograms of poisoning effort ratio in the offline attacks.

10

ACE,∆=1

9

Jun,∆=1

ACE,∆=0.5

8

Jun,∆=0.5

ACE,∆=0.2

7

Jun,∆=0.2

6

5

4

3

2

1

0

100

101

102

103

Time

10 ×104

Attacked

9

Without attack

8

7

6

5

4

3

2

1

0

0

2

4

6

Time

104

105

8

10

×104

Cost

Target arm pulls

250

ACE,∆=1 Jun,∆=1

ACE,∆=0.5

200

Jun,∆=0.5

ACE,∆=0.2 Jun,∆=0.2

150

100

50

0

100

101

102

103

Time

10 ×104

Attacked

9

Without attack

8

7

6

5

4

3

2

1

0

0

2

4

6

Time

104

105

8

10

×104

Target arm pulls

Cost

6

ACE,∆=1

ACE,∆=0.5

5

ACE,∆=0.2

4

3

2

1

0

100

101

102

103

Time

10 ×104

Attacked

9

Without attack

8

7

6

5

4

3

2

1

0

0

2

4

6

Time

104

105

8

10

×104

(a) Attack on -greedy algorithm

(b) Attack on UCB algorithm

(c) Attack on Thompson Sampling

Figure 4: Attack cost and target arm pulls in the online attacks.

Cost

Target arm pulls

5.2. Online attacks
We compare our adaptive attack strategy with two attack strategies proposed by (Jun et al., 2018). Note that these two attacks are against -greedy and UCB algorithm and require the knowledge of the algorithm. In contrast, we highlight that our attack strategy ACE is an universal attack strategy against any bandit algorithm.
We consider the following experiment. The bandit has two arms. The expected rewards of arms 1 and 2 are µ1 = ∆ and µ2 = 0 with ∆ > 0. The attacker’s target is arm 2. The noise of the rewards is i.i.d. sampled from the Gaussian distribution N (0, σ2) with σ = 0.1. We set the error tolerance to δ = 0.05 and time horizon to T = 105 rounds. For the implementations of the attack strategies proposed by (Jun et al., 2018), we choose the tuning parameter ∆0 = σ, which is
4High-probability bounds can be adapted from Proposition 3.

suggested by (Jun et al., 2018) when T is not known to the attacker. We run 100 attack trials with different ∆ values.
Figure 4 shows the average attack cost and number of target arm pulls in the online attacks. Note that the target arm pulls are the cases when ∆ = 1. First, we compare the number of target arm pulls with ACE attack and without. ACE attack dramatically forces the bandit algorithm to pull the target arm linearly in time. Second, the attack costs are relatively small compared to the loss of the bandit algorithm, which is linear in time. Generally, the attack costs of ACE attack are bounded by O(log T ) and increase as the reward gap ∆ becomes larger. These verify the result of Theorem 4. On the other hand, the attack costs on Thompson Sampling and -greedy are relatively smaller than that of UCB. This is because Thompson Sampling and -greedy converges to the “optimal” arm very fast while the exploration for “nonoptimal” arm may still increase over time. Finally, compared

Data Poisoning Attacks on Stochastic Bandits

to the algorithm-specific attacks proposed by (Jun et al., 2018), the attack cost of ACE on -greedy is slightly worse while the attack cost of ACE on UCB is much larger than Jun’s attack. In the case of attacking UCB algorithm, our universal attack strategy takes more cost than the algorithmspecific attack, but without the need to know the algorithm.
6. Conclusion
In this work, we study the open problem of data poisoning attacks on bandit algorithms. We propose an offline attack framework for the stochastic bandits and propose three algorithm-specific offline attack strategies against -greedy, UCB and Thompson Sampling. Then, we study an online attack against the bandit algorithms and propose the adaptive attack strategy that can hijack the behavior of any bandit algorithm without requiring the knowledge of the bandit algorithm. Our theoretical results and simulations show that the bandit algorithms are vulnerable to the poisoning attacks in both online and offline setting.
Acknowledgements
This work has been supported in part by a grant from DTRA (HDTRA1-14-1-0058), a grant from Office of Naval Research (N00014-17-1-2417), grants from National Science foundation (CNS-1446582, CNS-1518829, CNS-1409336) and a grant from Army Research Office (WN11NF-15-10277). This work has also been supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT), (2017-000692, Transport-aware Streaming Technique Enabling Ultra Low-Latency AR/VR Services).

Conference on Learning Representations, 2015. URL http://arxiv.org/abs/1412.6572.
Gupta, A., Koren, T., and Talwar, K. Better algorithms for stochastic bandits with adversarial corruptions. arXiv preprint arXiv:1902.08647, 2019.
Huang, S., Papernot, N., Goodfellow, I., Duan, Y., and Abbeel, P. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
Jun, K.-S., Li, L., Ma, Y., and Zhu, X. Adversarial attacks on stochastic bandits. arXiv preprint arXiv:1810.12188, 2018.
Korda, N., Kaufmann, E., and Munos, R. Thompson sampling for 1-dimensional exponential family bandits. In Advances in Neural Information Processing Systems, pp. 1448–1456, 2013.
Li, B., Wang, Y., Singh, A., and Vorobeychik, Y. Data poisoning attacks on factorization-based collaborative filtering. In Advances in neural information processing systems, pp. 1885–1893, 2016.
Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. ACM, 2010.
Lin, Y.-C., Hong, Z.-W., Liao, Y.-H., Shih, M.-L., Liu, M.-Y., and Sun, M. Tactics of adversarial attack on deep reinforcement learning agents. arXiv preprint arXiv:1703.06748, 2017.

References
Alfeld, S., Zhu, X., and Barford, P. Data poisoning attacks against autoregressive models. In AAAI, pp. 1452–1458, 2016.
Biggio, B., Nelson, B., and Laskov, P. Poisoning attacks against support vector machines. In Proceedings of the 29th International Conference on Machine Learning, pp. 1467–1474, 2012.
Chapelle, O., Manavoglu, E., and Rosales, R. Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology (TIST), 5(4):61, 2015.
Garivier, A., Lattimore, T., and Kaufmann, E. On explorethen-commit strategies. In Advances in Neural Information Processing Systems, pp. 784–792, 2016.
Goodfellow, I., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In International

Liu, F., Wang, S., Buccapatnam, S., and Shroff, N. Ucboost: a boosting approach to tame complexity and optimality for stochastic bandits. arXiv preprint arXiv:1804.05929, 2018.
Lykouris, T., Mirrokni, V., and Paes Leme, R. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 114–122. ACM, 2018.
Ma, Y., Jun, K.-S., Li, L., and Zhu, X. Data poisoning attacks in contextual bandits. arXiv preprint arXiv:1808.05760, 2018.
Mei, S. and Zhu, X. Using machine teaching to identify optimal training-set attacks on machine learners. In AAAI, pp. 2871–2877, 2015.
Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.

Data Poisoning Attacks on Stochastic Bandits
Wang, Y. and Chaudhuri, K. Data poisoning attacks against online learning. arXiv preprint arXiv:1808.08994, 2018.
Xiao, H., Biggio, B., Brown, G., Fumera, G., Eckert, C., and Roli, F. Is feature selection secure against training data poisoning? In International Conference on Machine Learning, pp. 1689–1698, 2015.

Preparing to load PDF file. please wait...

0 of 0
100%
Data Poisoning Attacks on Stochastic Bandits