OpenAI探讨人工智能安全:用对抗样本攻击机器学习

本文经机器之心(微信公众号:aimosthuman2014)授权转载,禁止二次转载,译者:微胖、李亚洲、吴攀@机器之心。

对抗样本是扮演攻击角色、试图用来引发模型出错的机器学习模型的输入;如同机器产生的光影幻觉。在这篇博文中,我们将为读者展示对抗样本在各种不同介质中的运作原理,还会讨论为什么系统难以防御它们。

在 OpenAI , 我们认为,对抗样本问题属于人工智能安全研究(我们正在从事)好的一面,因为它们代表着一个能在短期内加以解决的具体问题,由于解决它们比较难,因此需要进行严肃的科学研究。(尽管为了确保打造出一个安全、广为分布的人工智能系统,我们需要研究许多机器学习安全许多方面的问题。)

为了搞清楚对抗样本的庐山真面,请考虑一下这篇研究《解释并驯服对抗样本(Explaining and Harnessing Adversarial Examples)》中的例证:开始是一张熊猫图片,接着,攻击方给图片添加了小的扰乱,足以让这只熊猫被认定为一只长臂猿。

叠加在典型图片输入上的对抗输入会让分类器产生错觉,误将熊猫识别为长臂猿。

这一办法十分稳健;近期的一些研究也已经表明,在标准论文上打印出对抗样本,用一部标准像素智能手机拍下来后,这些样本仍然可以捉弄系统。

对抗样本可以在论文上打印出来,用标准像素手机拍下后,仍然可以捉弄分类器,在这个例子中,分类器将「洗衣机」识别为「保险箱」。

对抗样本具有潜在危险性。比如,攻击者可能会用贴纸或者一幅画做一个对抗式「停止(stop)」交通标志,将攻击对象瞄准自动驾驶汽车,这样,车辆就可能将这一「标志」解释为「放弃」或其他标识,进而引发危险。Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples 讨论过这个问题。

一些新近的研究,比如,伯克利,OpenAI 以及宾大联合发表的论文 Adversarial Attacks on Neural Network Policies, 内华达大学 Vulnerability of Deep Reinforcement Learning to Policy Induction Attacks ,表明强化学习智能体也能被对抗样本操控。研究表明,广为采用的强化学习算法,比如,DQN , TRPO 以及 A3C ,都经不起对抗样本的捉弄。这些对抗样本输入会降低系统性能,即使扰乱微妙地让人类也难以察觉,智能体会在应该往上移动的时候却将球拍向下移动,或者在 Seaquest 中识别敌人的能力受到干扰。

如果各位看官想玩坏自己的模型,不放试一下 cleverhans 这个开源库,它是 Ian Goodfellow  和 Nicolas Papernot  一起研发的,旨在测试面对对抗样本,你的人工智能模型有多脆弱。

在人工智能安全问题方面,对抗样本提供了一些牵引力

当你思考人工智能安全时,经常会考虑这个领域中最难的问题——我们如何能确保成熟的强化学习智能体(比人类要智能得多)能按照最初设计意图行事?

对抗样本向我们展示了这样一个事实:即使是简单的现代算法,不管是监督学习还是强化学习,都能以出乎人类意料的方式行事。

力图防卫对抗样本

让机器学习模型更加稳健的传统技术,比如权重衰减或者 dropout,通常无法切实防范对抗样本。到目前为止,仅有两个办法可以提供显著的防范措施。

对抗训练:这是一种蛮力解决方案。我们简单地生成许多对抗样本,明确训练模型不要被这些样本给骗了。cleverhans 库提供了一个开源的对抗训练实现,这个教程里有指南(https://github.com/openai/cleverhans/blob/master/tutorials/mnist_tutorial_tf.md)。

Defensive distillation (https://arxiv.org/abs/1511.04508): 在这一策略中,我们训练模型生成关于输入属于不同类别的概率,而不是硬让系统决定输入到底属于哪一类。这一概率由更早一些的模型提供,该模型是针对同一任务,用比较难的类别标签训练过的。这会让我们得到一种模型——其表面在对手通常会加以利用的方向上是平滑的,这会使得对手很难发现导致错误分类的对抗输入调整。(Distilling the Knowledge in a Neural Network (https://arxiv.org/abs/1503.02531) 最初将这个办法视为一种模型压缩技术,为了节省计算资源,小模型被训练用来模拟大模型。)

然而,只要个敌方再添加些计算火力,这些专门的算法也会被轻易攻下。

失败的防御:「梯度掩模(gradient masking)」

为了给出一个关于简单防御可能如何失败的案例,让我们思考一下为什么一种叫做「梯度掩模(gradient masking)」的技术没有效果。

「梯度掩模」是一个由 2016 年的论文《使用对抗样本对深度学习系统进行实际的黑盒攻击(Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples)》引入的术语,其描述了一整类试图通过拒绝攻击者对有用梯度(useful gradient)的访问权限而进行防御的失败方法。

大多数对抗样本构建技术都使用了模型的梯度来进行攻击。打个比方,它们查看了一张飞机图片,它们测试在图片空间中哪个方向会使「猫」类别的概率增加,然后它们在那个方向上给予一点推动(换句话说,它们干扰输入)。这样,新的修改过的图像就会被认为是一只猫。

但如果其中并没有梯度呢——如果图片上一个无穷小的修改不会给模型的输出造成任何改变呢?这似乎就能够提供一定程度的防御,因为攻击者无法获悉向哪个方向「推动」图像。

我们可以轻松地想象出一些非常简单的避免梯度的方式。比如,大部分图像分类模型都可归于两种模式:一是它们仅输出识别出的最有可能的类别,二是它们输出概率。如果一个模型的输出是「99.9% 的概率是飞机,0.1% 的概率是猫」,那么对输入的一点微小改变也会给输出带来一点微小的改变,而梯度就会告诉我们哪些改变会增加属于「猫」类的概率。如果我们运行的模型的模式是仅仅输出「飞机」而没有概率,那么一点微小的改变就不会对输出产生任何影响,梯度也不会让我们了解任何东西。

下面让我们进行一个思想实验,看我们的模型在处于「最有可能类别」模式而非「概率模式」类别时,可以如何防御对抗样本。攻击者不再需要寻找将被分类为「猫」的输入,所以我们可能已经有了一些防御。不幸的是,之前被分类为「猫」的图像现在仍然还是被分类为「猫」。如果攻击者可以猜测哪些点是对抗样本,那么这些点仍然可被错误地分类。所以这种方法不能让该模型更稳健;只是让攻击者在寻找模型防御的漏洞时没有那么多的线索罢了。

更不幸的是,事实证明攻击者在猜测防御漏洞时具有非常好的策略。攻击者可以训练一个他们自己的模型——一个有梯度的平滑的模型,并为他们的模型制作对抗样本,然后只需要部署这些对抗样本和我们的非平滑模型进行对抗即可。很多时候,我们的模型也会错误分类这些样本。最后,我们的思想实验表明:隐藏梯度不会给我们带来任何好处。

执行「梯度掩模」的防御策略通常会导致得到一个在特定方向上和训练点的附近非常平滑的模型,这会使得对手更难以找到指示了好的候选方向的梯度,从而更难以以破坏性的方式干扰该模型的输入。但是,对手可以训练一个「替代(substitute)」模型:一个模仿被保护的模型的副本——这可以通过观察被保护模型分配给对手仔细选择的输入的标签而实现。

这篇黑盒攻击论文介绍了一种用于执行这种模型提取攻击(model extraction attack)的方法。然后对手可以使用这种替代模型的梯度来寻找被被保护模型错误分类的对抗样本。在上图(该图来自论文《关于机器学习中的安全和隐私的科学(Towards the Science of Security and Privacy in Machine Learning)》中关于梯度掩模的讨论)中,我们给出了这种攻击策略在一个一维机器学习问题上的应用。该梯度掩模现象(gradient masking phenomenon)在更高维的问题上会加剧,但这是难以描述的。

我们发现对抗训练和 defensive distillation 都会偶尔执行一定类型的梯度掩模。这两种算法明显都不是为梯度掩模而设计的,但当机器学习算法要进行保护自己的训练而未被给出明确的方法指令时,梯度掩模显然是该机器学习算法能相对轻松地发明的一种防御。如果我们将对抗样本从一个模型迁移到另一个也经过对抗训练或 defensive distillation 训练过的模型,那么这个攻击通常会成功——即使当对第二个模型的直接进攻会失败时。这表明这两种模型做得更多的是展平模型和移除梯度,而不是确保其正确分类更多的点。

为什么防御(defend)对抗样本很难?

对抗样本难以防御是因为很难构造对抗样本处理过程的理论模型。对抗样本是许多机器学习模型非线性、非凸性优化问题的解决方案,包括神经网络在内。因为我们没有好的理论工具来描述这些复杂的优化问题的解决方案,所以也很难作出理论性争辩说一种防御能够排除一系列的对抗样本。

对抗样本难以防御也因为它们需要机器学习模型为每个可能的输入生成好的输出。大部分时间,机器学习模型做的很好,但只有在小量的可能输入的情况下它们可能会碰头。

目前为止我们测试的每个策略都失败了是因为它不够自适应(adaptive):它能够限制一种攻击,但对知道该防御手段的攻击者(attacker)而言,它有其它弱点。设计一种能够防御强大的、会自适应的攻击者的手段,这是一个重要的研究领域。

结论

对抗样本表明能以惊人的方式打破许多现代机器学习算法。机器学习上的这些失败证明即使是简单的算法也能表现出与设计初衷相差甚远的行为。我们鼓励机器学习研究人员参与进来,设计提防对抗样本的方法,从而缩短设计者初衷与算法表现之间的差距。

原文链接:https://openai.com/blog/adversarial-example-research/

译文链接:http://mp.weixin.qq.com/s/QVxf4cgutpHrE6wTBa6vQg

Applying Machine Learning to Improve Your Intrusion Detection System

Whether we realize it or not, machine learning touches our daily lives in many ways. When you upload a picture on social media, for example, you might be prompted to tag other people in the photo. That’s called image recognition, a machine learning capability by which the computer learns to identify facial features. Other examples include number and voice recognition applications.

From an intrusion detection perspective, analysts can apply machine learning, data mining and pattern recognition algorithms to distinguish between normal and malicious traffic.

Boosting Intrusion Detection With Machine Learning

One way that a computer can learn is by examples. For instance, a computer can learn to recognize a specific object, such as a car:

Red Car

The computer can extract features from the car such as its color — in this case, red. If we classify the object by its color, we can model it as follows:

Object ID Color Class
Red car
Blue not car
Red car

The algorithm then generates the following learning/classifying/decision tree:

decision tree

After the computer learns the above, you can ask it to classify the following object:

red rose

The computer will classify the rose as a car because it is also red. We need to extract more valuable and discriminate features, such as shape, to help the computer differentiate the car from any other red object.

The Need for Intelligent IDS

An intrusion detection system (IDS) monitors the network traffic looking for suspicious activity, which could represent an attack or unauthorized access. Traditional systems were designed to detect known attacks but cannot identify unknown threats. They most commonly detect known threats based on defined rules or behavioral analysis through baselining the network.

A sophisticated attacker can bypass these techniques, so the need for more intelligent intrusion detection is increasing by the day. Researchers are attempting to apply machine learning techniques to this area of cybersecurity.

The foundation of any intelligent IDS is a robust data set to provide examples from which the computer can learn. Today, however, very little security data is publicly available. That’s why I conducted an experiment in which I created a small, new data set with discernible features that can help analysts train computers to detect the most serious threats, even zero-day attacks.

Network Traffic Analysis

Network traffic can be analyzed at the packet, connection or session level. In general, the connection represents a bidirectional flow and the session represents multiple connections between the same source and destination.

In my prototype system, I used the powerful network analysis platform Bro to analyze traffic based on the connection level. Bro can monitor Transmission Control Protocol (TCP), User Datagram Protocol (UDP) and Internet Control Message Protocol (ICMP), and write the analyzed traffic to well-structured, tab-separated files suitable for post-processing. The platform interprets UDP and ICMP connection using flow semantics.

Feature Extraction

Bro writes several log files about network traffic. The conn.log file, for example, contains generic information about each connection, such as the time stamp, connection ID, source IP, source port, destination IP and destination port. This information is not enough. To extract more features from the network traffic, we need to create features and attributes to help us distinguish between normal and harmful traffic.

It is challenging to stick with generic features. It is not useful to extract features for each application-layer protocol, since there are thousands. In his paper, “Machine Learning for Application-Layer Intrusion Detection,” researcher Konrad Rieck explained the benefits of selecting generic features, such as those as shown below:

generic-features-suggested-by-Prof-K. Rieck

This is a great start, but we still need more features to help the machine recognize attacks. To add more depth to the analysis, we should determine whether the payload contains:

  • Shellcode;
  • JavaScript code;
  • SQL command or SQL injection queries; and
  • Command injection.

Those features can help the machine detect zero-day and web application attacks. To extract all the features, I limit the extraction process to the data sent by the source of the connection.

Most features can be extracted using a regular expression or calculated directly from the connection content. Shellcode is a notable exception, because attackers can encrypt, compress or encode it. I used Libemu, a x86 emulation and shellcode detection library, which works well but still can’t detect unencypted shellcodes. To solve this problem, at the suggestion of Dr. Ali Hadi, I used malware analysis platform Cuckoo Sandbox. Hadi suggested extracting more features from the traffic, such as the sequence of application program interfaces (APIs).

Both features are important for detecting shellcode and malware. By running the whole payload as a sequence of instructions in Cuckoo Sandbox, I can determine whether it represents an attack based on whether the system calls for a Windows Sockets 2 (Winsock) API.

Creating Useful Data Sets

So we’ve captured and analyzed the network traffic. How do we label it as normal or malicious traffic?

For my experiment, I installed Ubuntu to be used as a target machine, as well as the Damn Vulnerable Web Application (DVWA), a dummy application designed to help security professionals test their cyberdefense skills. I launched several attacks against the DVWA from a different computer, then used Bro to analyze the traffic between the two machines. I also configured Bro to extract the connection content as binary files.

I launched several types of attacks, such as SQL injection, command injection and cross-site scripting (XSS), against the vulnerable web application on the target machine. To conduct an SQL injection from the attacking machine, for example, open the target web app, navigate to the SQL injection tab and write the following in the text field:

(%’ or 0=0 union select null, table_name from information_schema. tables #)

If the web app is vulnerable, the result will look like this:

DVWA SQL Injection example

Bro then outputs several log files, including conn.log, which contains general information about each network connection. Each row in the image below represents a connection

Bro conn.log file

I also configured Bro to extract the content of the connection in a separate file as I performed the attacks. This way, I know what attack data was sent to the vulnerable web application. To classify the connections, I used a hex dump to see each connection content file:

hex dump of a connection content

According to the content, I classified the connection to the corresponding attack type. I then inspected the content of each connection file using the hex dump tool to find the exact attack traffic. We can see that the user sent the following in a GET request:

%25%27+0%3D0+union+select+null%2C+table_name+from+information_schema.tables+%23

After decoding the request, you will see the following:

%’ or 0=0 union select null, table_name from information_schema.tables #

Now that we’ve identified this connection content as an attack connection, specifically an SQL attack, we will label it as such in the spreadsheet.

The data set contains 41 instances with 33 attributes, as illustrated below.

Dataset attack classes

The following figure shows the newly created data set.

final classified dataset

Now that we have a good data set with features to detect advanced attacks, we can use it to train the computer to classify new connections.

Selecting and Classifying Features

I selected nine of the most important and generic features out of 33 to train the computer to recognize the attacks:

  • Protocol;
  • Service;
  • Entropy;
  • Number of nonprintable characters;
  • Number of punctuation characters;
  • Contains JavaScript;
  • Contains SQL statement;
  • Contains command injection; and
  • Class.

For the classification, I used Weka, a collection of machine learning algorithms for data mining tasks. For the testing, I used a cross-validation with 10 folds.

The table below shows the classification accuracy using several machine learning algorithms.

classification results using multiple learning algorithms

Intrusion Detection in the Cognitive Era

Security analysts can use machine learning to build an effective intrusion detection capability. The trick is to select the right features to create the most effective data set with which to train the machine to distinguish between normal and malicious traffic.

This is just one of the many ways IT professionals can apply cognitive computing to cybersecurity. You can even combine machine learning with your existing IDS by importing the induced rules from the classification tree into the system.

转载自:https://securityintelligence.com/applying-machine-learning-to-improve-your-intrusion-detection-system/

原文作者:Mutaz Alsallal

Finding Bugs in TensorFlow with LibFuzzer

Over the past year, I’ve spent some time working on improving the robustness of TensorFlow.  As I mentioned earlier, one of my goals for my time at Google was to dive into industry best-practices for writing good code.  At Google, writing good code starts with careful programmers, requires good tests that get run on a fantastic internal testing infrastructure, is improved through code review, and makes use of several code quality tools and linters.

One part of that testing that’s been gaining more visibility recently is fuzzing – throwing random inputs at programs or libraries to try to cause them to crash.  John Regehr has been fuzzing compilers for a while now – very effectively.  (That link has a nice taxonomy of the types of fuzzers.)  Google’s Project Zero has been fuzzing the FreeType library for the last 4 years, and has found a tremendous number of security vulnerabilities in all sorts of programs.   (This isn’t to suggest fuzzing is new – it’s been used for years by security researchers, and was invented in 1988 by Barton Miller.  Back in 1997, amusingly, before I knew what the term fuzzing was, a much younger me used a random input generator to find a bug in the terminal servers I was deploying at my ISP.  I remember being very pleased with myself at the time — probably excessively so.)

Modern, production fuzzers such as AFL and libFuzzer aren’t purely random, they can be guided:  They start with an input “corpus” of examples, and then mutate them.  They use compiler and binary rewriting support to guide the exploration of mutations to maximize code coverage.  The combination of a good starting corpus and coverage-guided mutation makes for impressive results.

Midway through my year at Google, Chris Olah, Dario Amodei, and Dan Mané started writing theirConcrete Problems in AI Safety paper.  While many of the problems and solutions they discuss are in the realm of machine learning, I was interested in the question of the systems side, provoked in part by their example of a machine intelligence-based system exploiting a buffer overflow in its reward function (by accident, but producing undesirable results).  A malfunctioning reward function isn’t the kind of thing that traditional isolation approaches such as sandboxing can prevent – it manifests as a logic bug, not an escape-into-the-rest-of-the-system.  Not that it’s a complete solution, but it made me curious whether there was value in trying to fuzz TensorFlow.  And so, encouraged by Kostya Serebryany, I decided to write some adapters from libFuzzer to some of the TensorFlow kernels.

(For those who note that some of these bugs wouldn’t have existed in more strongly-typed or safe languages, you’re probably right.  But high-performance machine learning software has very strong demands on it from both a performance, correctness, and flexibility perspective, and I’m very sympathetic to the challenge of achieving all three of these goals simultaneously.  I’m hopeful that the new XLA compiler framework might make it easier to achieve them, but since it’s still a work-in-progress, that remains to be seen!)

How to Fuzz

All libFuzzer tests start by writing a single function that tests a library call or calls, like this:

// fuzz_target.cc
extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
  DoSomethingInterestingWithMyAPI(Data, Size);
  return 0;  // Non-zero return values are reserved for future use.
}

When the compiled fuzzing binary is executed, it repeatedly invokes this function with “random” (mutated) data supplied in Data.  It’s your job as the fuzzing adapter writer to figure out how to map that blob of random data to calls to your application.

When fuzzing, the binary is typically compiled with LLVM’s address sanitizer, which detects several common memory errors, such as out-of-bounds array accesses, and turns them into a crash.  The libFuzzer driver detects that crash and saves the example that caused it.  An output might look something like this:

dga@instance-1:~/fuzz$ ./fuzz 
INFO: Seed: 44456222
INFO: Loaded 0 modules (0 guards): 
INFO: -max_len is not provided, using 64
INFO: A corpus is not provided, starting from an empty corpus
#0 READ units: 1
#1 INITED cov: 8 units: 1 exec/s: 0
#2 NEW    cov: 9 units: 2 exec/s: 0 L: 64 MS: 0 
=================================================================
==1310==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fff56ccc6e8 at pc 0x0000004f69e1 bp 0x7fff56ccc690 sp 0x7fff56ccc688

 

WRITE of size 4 at 0x7fff56ccc6e8 thread T0

Leaving behind a file that can be used to reproduce the crash.  After you fix the bug, you would typically add this file to the fuzz seed set to prevent regressions.

What to Fuzz in TensorFlow?

Much of what TensorFlow does is numeric, and finding bugs in those operations is important – it prevents massive headaches when trying to debug why a model isn’t working, for example.  But testing the numerical results requires a reference or spec that’s (believed to be) correct.  Some of the tests for the XLA compiler for TensorFlow do this, for example, by using the CPU version as a reference and ensuring that the XLA version produces the same answer.  This is an important set of tests for ensuring the correctness of XLA, but when I started writing my version, I didn’t have a reference, and didn’t want to try to write a spec for every operation.

Instead, I focused on the thing Fuzzing tends to do best, which is finding crashes.  I started with the lowest-hanging fruit, and the place I thought bugs were most likely to linger:  Complex input parsers.  TensorFlow includes functions to encode and decode several image formats, such astf.image.decode_png.

Starting with the image decoders made things easy – you can find existing corpora of example inputs to start with, and the functions effectively take only a single input, the string to decode into a Tensor.

Many bugs found and fixed.

This approach found bugs in the PNG decoder and the JPEG decoder.  Extra fuzzing done using automatic fuzzing infrastructure then turned up more, which Brennan Saeta kindly fixed.

After bashing against the image decoders, I turned to some of the string parsing functions, again finding a subtle bug in the strtonum function.  And then again in the TensorProto parser.

Interestingly, Fuzzing didn’t just find the expected buffer overflows or other failures — it also pointed out places where the code was unnecessarily fragile in their error handling.

A general design principle in TensorFlow is that errors in a kernel should be returned to the caller in a friendly way, so that they can deal with it appropriately.  A common pattern for handling this is to write code like this, in which immediately after entering the kernel, the programmer writes checks for as many error conditions as can be caught up-front:

explicit AsStringOp(OpKernelConstruction* ctx) : OpKernel(ctx) {

OP_REQUIRES_OK(ctx, ctx->GetAttr(T, &dtype));
OP_REQUIRES_OK(ctx, ctx->GetAttr(precision, &precision));

… do the work here after making sure the above worked …

This is good design for any library, since it doesn’t impose your idea of failure handling on your users.  By throwing the fuzzer at TensorFlow, I was able to find a few places where errors resulted in program termination instead of having a friendly return.

I’ve Become a Fan of Fuzzing

Before this, I’d primarily thought of fuzzing as something you did to find security holes in other people’s code.   Now I’ve changed my tune, and I’m convinced that using a modern fuzzer is a great part of normal software development and testing.  They won’t find all of your bugs, and you have to write the adapters to let them get inside your functions, but they’ll teach you new things about your code and help you find problems before they manifest in the real world.  I’m sold.

If you’re interested in trying it out on your own code, a good next step would be the libFuzzer tutorial.

Happy bug squishing!

(For more of my posts, see the archive.)

转载自:https://da-data.blogspot.com/2017/01/finding-bugs-in-tensorflow-with.html

原文作者:David Andersen

机器学习对抗性攻击报告

从12月29日起,神秘的账号Master在弈城、野狐等围棋对战平台上轮番挑战各大围棋高手,并取得了不可思议的多连胜。1月4日,聂卫平、常昊、周睿羊等高手接连输给Master,到目前截止它已获得60连胜。Master在与古力的对决之前终于揭晓了自己的身份,果然就是去年大出风头的AlphaGo(升级版),而对阵古力,也提前声明了会是最后一战。

我们不妨将Master的60连胜视为人工智能与人类交锋的信号和警报,在人工智能时代人类如何完成自身的“刷新升级”值得每个人思考。同时其带来的安全问题也急需安全专家去突破。

科技的发展使人工智能离人类的生活越来越近,其中隐含的安全问题也渐渐引起顶级安全专家们的关注。

本文由百度安全实验室专家撰写,详细介绍了在GeekPwn2016硅谷分会场上,顶尖安全专家们针对当前流行的人工智能图形对象识别、语音识别的场景所采用的构造攻击数据方式和攻击演示。

百度安全实验室的专家对这些攻击方式提出了独到见解,并对未来趋势进行了预测。

随着人工智能和机器学习技术在互联网的各个领域的广泛应用,其受攻击的可能性,以及其是否具备强抗打击能力一直是安全界一直关注的。之前关于机器学习模型攻击的探讨常常局限于对训练数据的污染。由于其模型经常趋向于封闭式的部署,该手段在真实的情况中并不实际可行。在GeekPwn2016硅谷分会场上,来自北美工业界和学术界的顶尖安全专家们针对当前流行的图形对象识别、语音识别的场景,为大家揭示了如何通过构造对抗性攻击数据,要么让其与源数据的差别细微到人类无法通过感官辨识到,要么该差别对人类感知没有本质变化,而机器学习模型可以接受并做出错误的分类决定,并且同时做了攻击演示。以下将详细介绍专家们的攻击手段。

1. 攻击图像语音识别系统

目前人工智能和机器学习技术被广泛应用在人机交互,推荐系统,安全防护等各个领域。具体场景包括语音,图像识别,信用评估,防止欺诈,过滤恶意邮件,抵抗恶意代码攻击,网络攻击等等。攻击者也试图通过各种手段绕过,或直接对机器学习模型进行攻击达到对抗目的。特别是在人机交互这一环节,随着语音、图像作为新兴的人机输入手段,其便捷和实用性被大众所欢迎。同时随着移动设备的普及,以及移动设备对这些新兴的输入手段的集成,使得这项技术被大多数人所亲身体验。而语音、图像的识别的准确性对机器理解并执行用户指令的有效性至关重要。与此同时,这一环节也是最容易被攻击者利用,通过对数据源的细微修改,达到用户感知不到,而机器接受了该数据后做出错误的后续操作的目的。并会导致计算设备被入侵,错误命令被执行,以及执行后的连锁反应造成的严重后果。本文基于这个特定的场景,首先简单介绍下白盒黑盒攻击模型,然后结合专家们的研究成果,进一步介绍攻击场景,对抗数据构造攻击手段,以及攻击效果。

1.1 攻击模型

和其他攻击不同,对抗性攻击主要发生在构造对抗性数据的时候,之后该对抗性数据就如正常数据一样输入机器学习模型并得到欺骗的识别结果。在构造对抗性数据的过程中,无论是图像识别系统还是语音识别系统,根据攻击者掌握机器学习模型信息的多少,可以分为如下两种情况:

· 白盒攻击

攻击者能够获知机器学习所使用的算法,以及算法所使用的参数。攻击者在产生对抗性攻击数据的过程中能够与机器学习的系统有所交互。

· 黑盒攻击

攻击者并不知道机器学习所使用的算法和参数,但攻击者仍能与机器学习的系统有所交互,比如可以通过传入任意输入观察输出,判断输出。

2. GeekPwn现场机器学习对抗性攻击

2.1 Physical Adversarial Examples

在GeekPwn2016硅谷分会场上,来自OpenAI的Ian Goodfellow和谷歌大脑的Alexey Kurakin分享了“对抗性图像”在现实物理世界欺骗机器学习的效果。值得一提的是,Ian Goodfellow正是生成式对抗神经网络模型的发明者。

首先先简单介绍一下对抗性图像攻击。对抗性图像攻击是攻击者构造一张对抗性图像,使人眼和图像识别机器识别的类型不同。比如攻击者可以针对使用图像识别的无人车,构造出一个图片,在人眼看来是一个stopsign,但是在汽车看来是一个限速60的标志。

1.jpg

图1 攻击图像识别场景

在会上,Ian和Alexey指出过去的对抗性图像工作都基于如下的攻击模型,即攻击者可以直接向机器学习模型输入数据,从而保证攻击者可以随心所欲地对任意粒度的图片进行修改,而不需要考虑灯光,图片角度,以及设备在读取图片时对对抗性图像攻击效果产生变化。因此,他们尝试了对抗性图片在真实物理世界的表现效果,即对抗性图片在传入机器学习模型之前,还经过了打印、外部环境、摄像头处理等一系列不可控转变。相对于直接给计算机传送一张无损的图片文件,该攻击更具有现实意义。

在如何构造对抗性攻击图片上,他们使用了非定向类攻击中的FGS和FGS迭代方法,和定向类的FGS迭代方法 [1]。其中,非定向类攻击是指攻击者只追求对抗图像和原图像不同,而不在意识别的结果是什么。定向类攻击则是指攻击者在构造图像时已经预定目标机器学习模型识别的结果。

在定向类攻击中,作者首先根据条件概率找出给定源图像,最不可能(least-likely)被识别的类型y值,表示为(该种类通常和原种类完全不同)。然后采用定向类攻击方法中的FGS迭代方法,产生对抗性图片。其中非定向类攻击方法在类型种类比较少并且类型种类差距较大的数据库里,比较有效。但是一旦类型之间比较相关,该攻击图像有极大的可能只会在同一个大类中偏移。这时候定向类攻击方法就会有效很多。

2.jpg

图2 对抗性图像在现实物理世界欺骗机器学习过程

为了验证结果,作者采用白盒攻击模型。其中,作者使用谷歌Inception v3作为目标图像识别模型,并选取ImageNet中的50,000个验证图像针对Inception v3构造出相对应的对抗性图像。在实验中,他们将所有的对抗性图片和原始图片都打印出来,并手动用一个Nexus 5智能手机进行拍照,然后将手机里的图像输入Inception v3模型进行识别。现场结果表明,87%的对抗性图像在经过外界环境转化后仍能成功欺骗机器,从而证明了物理对抗性例子在真实世界的可能性。

在他们的论文中,作者还测试了物理世界造成的图像转化对使用不同方法构造的对抗性图片的毁坏程度。有意思的结论是迭代方法受图像转化的影响更大。这是因为迭代方法对原图像使用了更微妙的调整,而这些调整在外界图像转化过程中更容易被毁坏。作者还分别测试了亮度、对比度、高斯模糊转化、高斯噪音转化和JPEG编码转化量度,对各个对抗性图像方法的毁坏程度。具体实验结果请参见他们的论文 [1]。

2.2   Exploring New Attack Space on Adversarial Deep Learning

来自UC Berkeley大学的Dawn Song教授和刘畅博士介绍了对抗式深度学习在除了其他领域的攻击和防御。其中Dawn Song教授是Taint Analysis理论的主要贡献者之一,还是美国“麦克阿瑟天才奖”获得者。在现场,专家们首先拓展了对抗性深度学习在图像识别检测上的应用,然后还提出构造对抗性图片的优化方法-ensemble黑盒攻击算法[6]。

在图像识别物体检测中,如图3左图所示,深度学习可以用来检测图像中不同的物体以及他们之间的关系并自动生成说明(Caption) [2]。在这种场景下,对抗性图像攻击同样可以欺骗机器学习模型,并给出异常的说明,如图3右图所示。对抗性图像构建的基本思路是给定Caption的前缀后,尽量误导之后的判断。

3.jpg

图3 对抗性图片在图像识别检测中的应用

同时,专家们还研究了对抗性图像攻击在黑盒分类模型中的表现,并且提出了优化算法-ensemble黑盒攻击算法。在通常情况下,攻击者并不知道目标模型使用了什么算法已经相关的参数。这时候攻击者只能使用黑盒模型攻击。过程如下所示:

1. 攻击者在目标机器学习模型未知的情况下,通过询问黑盒子系统所得结果,得到一系列训练样本。

2. 攻击者任意选取了某机器学习算法并使用训练样本训练得到已知机器学习模型。

3. 攻击者针对训练出来的已知机器学习模型构建对抗数据。

4.jpg

图4 对抗性图像黑盒攻击流程

这一攻击基于对抗性图像的欺骗传递性,即针对机器学习模型A构造的对抗性图像,也会有很大的比例能欺骗机器学习模型B。表1展示了使用单网络优化方法时,针对不同元模型构造的非定向对抗性图像,被不同目标模型识别的成功率。每一个格子(i,j)代表针对算法模型i产生的对抗图片,在其他算法模型j上验证的结果,百分比表示所有对抗性图片中被识别成原图片类型的比例。可以看出,当同一个图像识别系统被用来构造和验证对抗性图像时(白盒攻击模型),百分比为0。这说明在白盒攻击模型中,构建对抗性图像的效果非常好,全部不能正确识别。当验证模型和构造模型并不一致时,大部分对抗性图像的百分比也在10%-40%之间浮动,该结果有效证明了对抗数据在不同算法之间有一定的传递性。

table.png

表1 针对不同源机器学习模型构造的非定向对抗性攻击方法(单网络优化方法)在目标模型的攻击效果。其中,ResNet-50, ResNet-101, ResNet-152,GoogLeNet,Incept-v3和VGG-16是当下流行的深度神经网络图像识别系统。

然而,作者还使用了同样的实验方法测试了定向性对抗性攻击在目标模型的效果。结果表明定向类标记的传递性差了很多,只有小于等于4%的对抗性图像在源、目标机器学习模型中都识别出相同的定向标记。

基于此,作者提出了ensemble方法。它是以多个深度神经网络模型为基础构造对抗图片,即将图4中单个已知机器学习模型替换成多个不同的已知机器学习模型,并共同产生一个对抗性图像。

在实验设计中,作者对5个不同的深度神经网络模型一一实施了黑盒子攻击。在对每一个模型攻击的时候,作者假设已知其余的4个模型,并用集合的方式作白盒子对抗图形的构造。同样的,作者分别使用基于优化的攻击手段,和基于Fast Gradient的手段构造对抗性图片。构造图片依然使用的是Adam优化器。在算法经过100次的迭代对权重向量的更新,loss function得以汇聚。作者发现有许多攻击者预先设定的标记也得到了传递。详细结果参见表2。格子(i,j)代表用除了模型i之外的4个其他算法生成的对抗图片,用模型j来验证得到的定向标记的准确值。可以看出,当目标模型包含在已知模型集合中,定向类标记的传递性都在60%以上。即使目标模型不在已知模型集合中,定向标记的准确值也在30%以上。

table2.png

表2 针对不同源机器学习模型构造的定向对抗性攻击方法(ensemble方法)在目标模型的攻击效果。

作者同样使用了ensemble算法进行非定向攻击。攻击结果如表3所示。可以看出和表1相比,ensemble算法的欺骗性大幅度提升。

table3.png

表3 针对不同源机器学习模型构造的非定向对抗性攻击方法(ensemble方法)在目标模型的攻击效果。

2.3 Hidden Voice Commands

5.jpg

图5 攻击语音识别场景

来自美国Georgetown University的Tavish Vaidya博士分享了隐匿的语音命令这一工作。

对抗性语音攻击则是攻击者构造一段语音,使人耳和语音识别机器识别的类型不同。语音攻击和图像攻击最大的不同在于,它希望保证对抗性语音和原语音差距越远越好,而不是保持对抗性语音和原语音的相似性。该团队依据现实情况,分别提出了黑盒攻击和白盒攻击两种类型。在他们的实验中,扬声器发出一段人类无法辨认的噪音,却能够在三星Galaxy S4以及iPhone 6上面被正确识别为相对应的语音命令,达到让手机切换飞行模式、拨打911等行为 [3]。

黑盒攻击(语音识别):

在黑盒攻击模型中,攻击者并不知道机器学习的算法,攻击者唯一的知识是该机器使用了MFC算法。MFC算法是将音频从高维度转化到低纬度的一个变换,从而过滤掉一些噪声,同时保证机器学习能够操作这些输入。但是从高维到低维的转化过程中,不可避免地会丢失一些信息。相对应的,从低维到高维的转化,也会多添加一些噪声。黑盒攻击的原理正是攻击者通过迭代,不断调整MFCC的参数并对声音进行MFCC变换和逆变换,过滤掉那些机器不需要,而人类所必须的信息,从而构造出一段混淆的语音。因为MFC算法被大量用于语音识别这个场景,所以该攻击模型仍保证了很强的通用性。该具体步骤如图4所示,感兴趣的读者可以参见他们的论文 [3].

6.jpg

图6 对抗性语音黑盒攻击模型[3]

在实验中,作者发现使用的语音识别系统只能识别3.5米之内的语音命令。在扬声器和手机的距离控制在3米的情况下,表4统计了人类和机器对不同命令的识别的比例。平均情况下,85%正常语音命令能被语音识别。在他们的混淆版本中,仍有60%的语音命令能被正常识别。在人类识别类别中,作者使用Amazon Mechanical Turk服务,通过crowd sourcing的形式让检查员猜测语音的内容。在这种情况下不同的命令混淆的效果也不尽相同。对于”OK Google”和”Turn on airplane mode”命令,低于25%的混淆命令能够被人类正确识别。其中,94%的”Call 911”混淆版本被人类正常识别比较异常。作者分析了两个主要原因。1是该命令太过熟悉。2是测试员可多次重复播放语音,从而增加了猜测成功的概率。

table4.png

表4 对抗性语音黑盒攻击结果。[3]

白盒攻击(语音识别):

在白盒子攻击中,该团队对抗的目标机器学习算法是开源的CMU Sphinx speech recognition system [4]。在整个系统中,CMU Sphinx首先将整段语音切分成一系列重叠的帧(frame), 然后对各帧使用Mel-Frequency Cepstrum (MFC)转换,将音频输入减少到更小的一个维度空间,即图7中的特征提取。然后,CMU Sphinx使用了Gaussian Mixture Model(GMM)来计算特定音频到特定音素(phoneme)的一个概率。最后通过Hidden Markov Model(HMM),Sphinx可以使用这些音素(phoneme)的概率转化为最有可能的文字。这里GMM和HMM都属于图7中的机器学习算法。

7.jpg

图7 CMU Sphinx speech recognition system模型[4]

在Tavish的白盒攻击模型中,他分别提出了两个方法:1.simple approach 2. Improved attack. 第一个方法和黑盒方法的不同点在于,它已知了MFCC的各参数,从而可以使用梯度下降更有针对性地只保留对机器识别关键的一些关键值。在整个梯度下降的过程中,input frame不断地逼近机器识别的目标y,同时人类识别所需的一些多余信息就被不可避免地被剔除了。

第二类白盒攻击的基本原理是依据机器和人对音高低起伏变化(音素)的敏感性不同,通过减少每个音素对应的帧(frame)的个数,让这段声音只能被机器识别,而人类只能听到一段扁平混乱的噪音。这些特征值再经过MFCC逆变换,最终成为一段音频,传到人们耳中。具体的方法和语音相关的知识更密切一下,有兴趣的读者可以看他们的论文了解具体的方法。表5展示了他们的攻击效果。

table5.png

表5 对抗性语音白盒攻击效果。[3]

2.4  对抗性数据的防护

虽然对抗性数据攻击的发现很巧妙,但是在当前图像语音识别应用的场合上,有效的防御并不困难。主要有以下几类:

1. 增加人类交互认证,例如机器可以简单地发出一声警报、或请求输入音频验证码等方式。

2. 增强对抗性数据作为机器学习模型的输入的难度。例如语音识别系统可以使用声纹识别、音频滤波器等方式过滤掉大部分恶意语音。

3. 从机器学习模型本身训练其辨别良性、恶意数据的能力。这时候,这些已知的对抗性数据就提供了珍贵的训练数据。

4. 宾州州立大学还提出Distillation的方法 [5],从深度神经网络提取一些指纹来保护自己。

随着人工智能深入人们的生活,人类将越发依赖人工智能带来的高效与便捷。同时,它也成为攻击者的目标,导致应用机器学习的产品和网络服务不可依赖。GeekPwn2016硅谷分会场所揭示的是顶级安全专家对机器学习安全方面的担忧。随着一个个应用场景被轻易的攻破,尽管目前只是在语音,图像识别等场景下,我们可以清醒的认识到,当这些场景与其他服务相结合的时候,攻击成功的严重后果。人工智能作为未来智能自动化服务不可缺少的一个重要部分,已然是安全行业与黑产攻击者抗争的新战场。

Bibliography

[1] A. Kurakin, I. J. Goodfellowand S. Bengio, “Adversarial examples in the physical world,” corr,2016.

[2] J. Justin, K. Andrej and F.Li, “Densecap: Fully convolutional localization networks for densecaptioning.,” arXiv preprint arXiv:1511.07571 , 2015.

[3] N. Carlini, P. Mishra, T.Vaidya, Y. Zhang, M. Sherr, C. Shields, D. Wagner and W. Zhou, “HiddenVoice Commands,” in USENIX Security 16, Austin, 2016.

[4] P. Lamere, P. Kwork, W.Walker, E. Gouvea, R. Singh, B. Raj and P. Wolf, “Design of the CMUSphinx-4 Decoder,” in Eighth European Conference on Speech Communicationand Technology, 2003.

[5] N. Papernot, P. McDaniel, X.Wu, S. Jha and A. Swami, ” Distillation as a Defense to AdversarialPerturbations against Deep Neural Networks Authors:”.

[6]Y. Liu, X. Chen, C. Liu andD. Song, “Delving into transferable adversarial examples and black-boxattacks,” in ARXIV.

* 本文转载自“百度安全实验室”微信公众账号,作者曹跃、仲震宇、韦韬

原文链接:http://mp.weixin.qq.com/s/QKXd9AKkVwk3CO45-BbZSA