Abstract. In this study, I evaluate some popular deep learning toolkits. The candidates are listed in alphabetical order:Caffe,CNTK,TensorFlow,Theano, andTorch. This is a dynamic document and the evaluation, to the best of my knowledge, is based on the current state of their code.
I also provide ratings in some areas because for a lot of people, ratings are useful. However, keep in mind that ratings are inherently subjective .
If you find something wrong or inadequate, please help improve by filing an issue.
Table of contents
In this section, we evaluate each toolkit's ability to train common and state-of-the-art networks without writing too much code. Some of these networks are:
In addition, we also evaluate the flexibility to create a new type of model.
模型 相容性： 在此章节中，评价每个工具箱 在不修改更多代码的情况下 训练通用和日新月异的网络的能力。
卷积神经网络： AlexNet, OxfordNet, GoogleNet
递归神经网路 : plain RNN, LSTM/GRU, bidirectional RNN
Caffe 作为 社区和业界最为流行的深度神经网络，具有很强的伸缩性、扩展性和相容性；但是对递归神经网络的支持比较贫乏。
Caffe is perhaps the first mainstream industry-grade deep learning toolkit, started in late 2013, due to its excellent convnet implementation (at the time). It is still the most popular toolkit within the computer vision community, with many extensions being actively added.
However, its support for recurrent networks and language modeling in general is poor, due to its legacy architecture, which's limitations are detailed in thearchitecture section.
CNTK在speech社区更为流行。在CNTK（如 TensorFlow 和 Theano ），网络作为一个向量操作图，栗如 矩阵 加和乘。一个层是这种运算的组合。buildding blocks 的微调粒度允许 在不执行底层的情况下 创建一个更复杂的层。
CNTK is a deep learning system started by the speech people whostarted the deep learning craze and grown into a more general platform-independent deep learning system. It is better known in the speech community than in the general deep learning community.
In CNTK (as in TensorFlow and Theano), a network is specified as a symbolic graph of vector operations, such as matrix add/multiply or convolution. A layer is just a composition of those operations. The fine granularity of the building blocks (operations) allows users to invent new complex layer types without implementing them in a low-level language (as in Caffe).
As of today, CNTK is not usable for a variety of tasks such as sequence-2-sequence.
tensorflow 是一个较新的网络，对RNN的表示较为容易且有效（使用桶的方法 ）；特点：RNN API、次最优执行；双向RNN；暂时没有适用于视频的3D卷积。
每一个计算流被构建为一个静态图，这会使一些计算困难，比如 柱搜索 方法（常用于序列预测任务的方法）。
New modelsSince TF uses symbolic graph of vector operations approach, specifying a new network is fairly easy. Although it doesn't support symbolic loop yet (at least not well tested/documented, as of 05/2016), RNNs can be made easy and efficient using the bucketing trick.
However, TF has a major weakness in terms of modeling flexibility. Every computational flow has be constructed as a static graph. That makes some computations difficult, such asbeam search (which is used frequently in sequence prediction tasks).
新的模型：Theano倡导使用符号图表 运行网络，其符号API支持 环控制--成为 搜索，这种方法使RNN执行变得容易且有效；
New models. Theano pioneered the trend of using symbolic graph for programming a network. Theano's symbolic API supports looping control, so-calledscan, which makes implementing RNNs easy and efficient. Users don't always have to define a new model at the tensor operations level. There are a few higher-level frameworks, mentioned above, which make model definition and training simpler.
conv2dbut that's a trick. The native interface for temporal convolution in Torch makes it slightly more intuitive to use.
New models. In Torch, there are multiple ways (stack of layers or graph of layers) to define a network but essentially, a network is defined as a graph of layers. Because of this coarser granularity, Torch is sometimes considered less flexible because for new layer types, users have to implement the full forward, backward, and gradient input update.
However, unlike Caffe, defining a new layer in Torch is much easier because you don't have to program in C++. Plus, in Torch, the difference between new layer definition and network definition is minimal. In Caffe,
layers are defined in C++ while networks are defined via
Torch is more flexible than TensorFlow and Theano in that it is imperative while TF/Theano are declarative (i.e. one has to declare a computational graph). That makes some operations, e.g. beam search, much easier to do in Torch.
TF/Theano are declarative使用陈述时语言（语法图），而命令式语言的Torch则显得扩展性更强。使得一些方法如柱搜索 更加容易。
Left: graph model of CNTK/Theano/TensorFlow; Right: graph model of Caffe/Torch
pycaffe interface but that's a mere secondary alternative to the command line interface. The model has to be defined in protobuf (usually with a plain text editor), even if you use
The way to use CNTK, similar to Caffe, is to specify a config file and run command line. CNTK is slightly worse than Caffe because there's no Python or any other high-level language interface.
TF supports two interfaces: Python and C++. This means that you can do experiments in a rich, high-level environment and deploy your model in an environment that requires native code or low latency.
It would be perfect if TF supports
TypeScript. The lack of static type in Python is just ... painful :).
Torch runs on LuaJIT, which is amazingly fast (comparable with industrial languages such as C++/C#/Java). Hence developers don't have to think about symbolic programming, which can be limited. They can just write all kinds of computations without worrying about performance penalty.
However, let's face it, Lua is not yet a mainstream language.
How easy to deploy a new model?
Caffe is C++ based, which can be compiled on a variety of devices. It is cross-platform (windows port is available and maintainedhere).
Which makes Caffe the best choice with respect deployment.
Like Caffe, CNTK is also C++ based and is cross-platform. Hence, deployment should be easy in most cases. However, to my understanding, it doesn't work on ARM architecture, which limits its its capability on mobile
TF supports C++ interface and the library can be compiled/optimized on ARM architectures because it usesEigen (instead of a BLAS library). This means that you can deploy your trained models on a variety of devices (servers or mobile devices) without having to implement a separate model decoder or load Python/LuaJIT interpreter .
TF doesn't work on Windows yet so TF models can't be deployed on Windows devices though.
The lack of low-level interface and the inefficiency of Python interpreter makes Theano less attractive for industrial users. For a large model, the overhead of Python isn’t too bad but the dogma is still there.
The cross-platform nature (mentioned below) enables a Theano model to be deployed in a Windows environment. Which helps it gain some points.
Torch require LuaJIT to run models. This makes it less attractive than bare bone C++ support of Caffe/CNTK/TF. It’s not just the performance overhead, which is minimal. The bigger problem is integration, at API level, with a larger production pipeline.
All of these toolkits call cuDNN so as long as there’s no major computations or memory allocations at the outer level, they should perform similarly.
Soumith@FB has done some benchmarking for ConvNets. Deep Learning is not just about feedforward convnets, not just about ImageNet, and certainly not just about a few passes over the network. However, Soumith’s benchmark is the only notable one as of today. So we will base the Single-GPU performance rating based on his benchmark.
可以在一个 TitanX GPU上运行 的TensorFlow...表现如下表：
TensorFlow used to be slow when it first came out but as of 05/2016, it has reached the ballpark of other frameworks in terms of ConvNet speed. This is not surprising because every framework nowadays calls CuDNN for the actual computations.
Here's my latest micro benchmark of TensorFlow 0.8 vs before. The measurement is latency, in milliseconds, for one full minibatch forward-backward pass on a single Titan X GPU.
|Network||TF 0.6 [ref]||TF 0.8 [my run]||Torch FP32 [my run]|
On big networks, Theano’s performance is on par with Torch7, according to this benchmark. The main issue of Theano is startup time, which is terrible, because Theano has to compile C/CUDA code to binary. We don’t always train big models. In fact, DL researchers often spend more time debugging than training big models. TensorFlow doesn’t have this problem. It simply maps the symbolic tensor operations to the already-compiled corresponding function calls.
import theano takes time because this
import apparently does a lot of stuffs. Also, after
import Theano, you are stuck with a pre-configured device (e.g.
Caffe's architecture was considered excellent when it was born but in the modern standard, it is considered average. The main pain points of Caffe are its layer-wise design in C++ and the protobuf interface for model definition.
Layer-wise design. The building block of a network in Caffe is layer.
Protobuf. Caffe has
pycaffe interface but that's a mere replacement of the command line interface. The model has to be defined in protobuf (usually with a plain text editor), even if
[Copied from my own answer on Quora]
To be updated ...
TF has a clean, modular architecture with multiple frontends and execution platforms. Details are in thewhite paper.
The architecture is fairly hacky: the whole code base is Python where C/CUDA code is packaged as Python string. This makes it hard to navigate, debug, refactor, and hence contribute as developers.
Torch7 and nn libraries are also well-designed with clean, modular interfaces.
Caffe, CNTK, and Theano work on all OSes. TensorFlow and Torch do not work on Windows and there's no known plan to port from either camp.
对于个人实验者极力推崇 Caffe 用于 CNN；另外使用RNN的科学工作者，推荐使用 Theano 和 TensorFlow。
 Note that I don’t aggregate ratings because different users/developers have different priorities.
 Disclaimer: I haven’t analyzed this extension carefully.
 See my blog post for why this is desirable.