VGSL仕様 - 画像用の混合畳み込み/LSTMネットワークの高速プロトタイピング

可変サイズグラフ仕様言語（VGSL）を使用すると、非常に短い定義文字列から、畳み込みとLSTMで構成され、可変サイズの画像を処理できるニューラルネットワークの仕様を定義できます。

アプリケーション：VGSL仕様は何に適していますか？

VGSL仕様は、特に次のネットワークを作成するために設計されています。

入力として可変サイズの画像。（1つまたは両方の次元で！）
画像（ヒートマップ）、シーケンス（テキストなど）、またはカテゴリを出力します。
畳み込みとLSTMが主な計算コンポーネントです。
固定サイズの画像もOKです！

モデル文字列の入力と出力

ニューラルネットワークモデルは、入力仕様、出力仕様、およびその間のレイヤー仕様を記述する文字列によって記述されます。例

[1,0,0,3 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]

最初の4つの数字は入力のサイズとタイプを指定し、画像テンソルのTensorFlowの規則に従います。[バッチ、高さ、幅、深さ]。バッチは現在無視されますが、最終的にはトレーニングミニバッチサイズを示すために使用される可能性があります。高さまたは幅はゼロにすることができ、可変にすることができます。高さまたは幅にゼロ以外の値を指定すると、すべての入力画像がそのサイズであることが期待され、必要に応じて適合するように曲げられます。深さはグレースケールの場合は1、カラーの場合は3である必要があります。特別なケースとして、深さの値が異なり、高さが1の場合、画像は入力から垂直ピクセルストリップのシーケンスとして扱われます。**注意：全体を通して、xとyは従来の数学とは逆に**、TensorFlowと同じ規則を使用します。TFがこの規則を採用した理由は、入力時に画像を転置する必要をなくすためです。画像内の隣接するメモリ位置はxが増加してからyが増加するのに対し、TF内のテンソルおよびtesseractのNetworkIO内の隣接するメモリ位置は、C配列のように、最も右側のインデックスが最初に増加し、次に左側のインデックスが増加するためです。

最後の「単語」は出力仕様であり、次の形式をとります

O(2|1|0)(l|s|c)n output layer with n classes.
  2 (heatmap) Output is a 2-d vector map of the input (possibly at
    different scale). (Not yet supported.)
  1 (sequence) Output is a 1-d sequence of vector values.
  0 (category) Output is a 0-d single vector value.
  l uses a logistic non-linearity on the output, allowing multiple
    hot elements in any output vector value. (Not yet supported.)
  s uses a softmax non-linearity, with one-hot output in each value.
  c uses a softmax with CTC. Can only be used with s (sequence).
  NOTE Only O1s and O1c are currently supported.

クラスの数は（TensorFlowとの互換性のためだけに存在するため）無視され、実際の数はユニチャーセットから取得されます。

間のレイヤーの構文

すべての操作は、次元がどのように縮小されていても、4次元テンソルの標準的なTF規則：[バッチ、高さ、幅、深さ]を入力および出力することに注意してください。これにより、物事が大幅に簡素化され、VGSLSpecsクラスが幅と高さの値の変更を追跡できるため、LSTM操作に正しく渡され、ダウンストリームのCTC操作で使用できます。

注：以下の説明では、<d>は数値であり、リテラルは正規表現構文を使用して記述されています。

注：演算の間には空白を入れることができます。

関数演算

C(s|t|r|l|m)<y>,<x>,<d> Convolves using a y,x window, with no shrinkage,
  random infill, d outputs, with s|t|r|l|m non-linear layer.
F(s|t|r|l|m)<d> Fully-connected with s|t|r|l|m non-linearity and d outputs.
  Reduces height, width to 1. Connects to every y,x,depth position of the input,
  reducing height, width to 1, producing a single <d> vector as the output.
  Input height and width *must* be constant.
  For a sliding-window linear or non-linear map that connects just to the
  input depth, and leaves the input image size as-is, use a 1x1 convolution
  eg. Cr1,1,64 instead of Fr64.
L(f|r|b)(x|y)[s]<n> LSTM cell with n outputs.
  The LSTM must have one of:
    f runs the LSTM forward only.
    r runs the LSTM reversed only.
    b runs the LSTM bidirectionally.
  It will operate on either the x- or y-dimension, treating the other dimension
  independently (as if part of the batch).
  s (optional) summarizes the output in the requested dimension, outputting
    only the final step, collapsing the dimension to a single element.
LS<n> Forward-only LSTM cell in the x-direction, with built-in Softmax.
LE<n> Forward-only LSTM cell in the x-direction, with built-in softmax,
  with binary Encoding.

上記では、(s|t|r|l|m)が非線形性のタイプを指定します。

s = sigmoid
t = tanh
r = relu
l = linear (i.e., No non-linearity)
m = softmax

例

Cr5,5,32は、深さ/フィルター数が32の5x5 Relu畳み込みを実行します。

Lfx128は、y次元を独立して扱い、128の出力を持つx次元で順方向のみのLSTMを実行します。

Lfys64は、x次元を独立して扱い、y次元を1要素に縮小して、64の出力を持つy次元で順方向のみのLSTMを実行します。

配管演算

配管演算により、任意に複雑なグラフを構築できます。現在欠落しているのは、複数の場所でインセプションユニットなどを生成するためのマクロを定義する機能です。

[...] Execute ... networks in series (layers).
(...) Execute ... networks in parallel, with their output concatenated in depth.
S<y>,<x> Rescale 2-D input by shrink factor y,x, rearranging the data by
  increasing the depth of the input by factor xy.
  **NOTE** that the TF implementation of VGSLSpecs has a different S that is
  not yet implemented in Tesseract.
Mp<y>,<x> Maxpool the input, reducing each (y,x) rectangle to a single value.

完全な例：高品質のOCRが可能な1次元LSTM

[1,1,0,48 Lbx256 O1c105]

レイヤーの説明として：（入力レイヤーは一番下に、出力は一番上にあります。）

O1c105: Output layer produces 1-d (sequence) output, trained with CTC,
  outputting 105 classes.
Lbx256: Bi-directional LSTM in x with 256 outputs
1,1,0,48: Input is a batch of 1 image of height 48 pixels in greyscale, treated
  as a 1-dimensional sequence of vertical pixel strips.
[]: The network is always expressed as a series of layers.

このネットワークは、入力画像が垂直方向に注意深く正規化され、ベースラインとミーンラインが一定の位置にある限り、OCRにうまく機能します。

完全な例：高品質のOCRが可能な多層LSTM

[1,0,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]

レイヤーの説明として：（入力レイヤーは一番下に、出力は一番上にあります。）

O1c105: Output layer produces 1-d (sequence) output, trained with CTC,
  outputting 105 classes.
Lfx256: Forward-only LSTM in x with 256 outputs
Lrx128: Reverse-only LSTM in x with 128 outputs
Lfx128: Forward-only LSTM in x with 128 outputs
Lfys64: Dimension-summarizing LSTM, summarizing the y-dimension with 64 outputs
Mp3,3: 3 x 3 Maxpool
Ct5,5,16: 5 x 5 Convolution with 16 outputs and tanh non-linearity
1,0,0,1: Input is a batch of 1 image of variable size in greyscale
[]: The network is always expressed as a series of layers.

要約LSTMにより、このネットワークはテキストの垂直位置の変動に対してより回復力があります。

可変サイズ入力と要約LSTM

現在、未知のサイズの次元を既知のサイズ（1）に縮小する唯一の方法は、要約LSTMを使用することに注意してください。単一の要約LSTMは、1つの次元（xまたはy）を縮小し、1次元シーケンスを残します。次に、1次元シーケンスを他の次元で縮小して、0次元のカテゴリカル（softmax）または埋め込み（ロジスティック）出力を作成できます。

したがって、OCRの場合、入力画像の高さは固定されている必要があり、最上位レイヤーによって垂直方向に1に（MpまたはSを使用して）スケーリングされるか、可変高さの画像を許可する場合は、要約LSTMを使用して垂直次元を単一の値に縮小する必要があります。要約LSTMは、固定高さ入力でも使用できます。