Towards Bridging Machine Learning and Logical Reasoning

(Press `?`

for help, `n`

and `p`

for next and previous slide)

and Reasoning

**... now is VERY good at**

Mapping **sensory information** to a **concept**.

**... now is NOT very good at**

**... now is way worse than human at**

```
Question: Solve -42*r + 27*c = -1167 and 130*r + 4*c = 372 for r.
Answer: 4
Question: Calculate -841880142.544 + 411127.
Answer: -841469015.544
Question: Let x(g) = 9*g + 1. Let q(c) = 2*c + 1. Let f(i) = 3*i - 39. Let w(j) = q(x(j)). Calculate f(w(a)).
Answer: 54*a - 30
Question: Let e(l) = l - 6. Is 2 a factor of both e(9) and 2?
Answer: False
Question: Let u(n) = -n**3 - n**2. Let e(c) = -2*c**3 + c. Let l(j) = -118*e(j) + 54*u(j). What is the derivative of l(a)?
Answer: 546*a**2 - 108*a - 118
Question: Three letters picked without replacement from qqqkkklkqkkk. Give prob of sequence qql.
Answer: 1/110
```

\begin{equation}
R_{emp}=\frac{1}{n}\sum_{i=1}^n L(h(\mathbf{x}_i),y_i)
\end{equation}

\begin{equation}
\hat{h}=\text{arg}\min_{h\in\mathcal{H}}R_{emp}(h)
\end{equation}

- Hard to tell machines what we know;
- Hard to understand machines from what they learned.

**... is the very first task in AI**

- General Problem Solving
- Automated Theorem Proving
- Boolean Satisfiability
- Expert Systems
- Logic Programming
- Inductive Logic Programming
- Probabilistic Logic Programming
- Constraint Logic Programming, Answer Set Programming
- …

A physical symbol system has the necessary and sufficient means for general intelligent action.

— Allen Newell and Herbert A. Simon, 1975.

Real objects seldom wear unique identifiers or preannounce their existence like the cast of a play.

— Stuart Russell, 2015.

- Multi-valued / Fuzzy Logic;
- How to define “\(\rightarrow\)” (implication);

- Statistical Relational Learning &

Probabilistic Logic Programming;- Probabilistic Graphical Model;
- Independence assumptions, pseudo-likelihood

- Neural Symbolic Learning;
- Embedding representation;
- Fuzzy logic operators;

**Abductive Reasoning**

\begin{eqnarray}
\color{#CC9393}{\mathbf{highlight}}(Dir, Obj) &\leftarrow&\\
&&\hspace{-6em} \color{#8CD0D3}{\mathbf{convex}}(Obj)\wedge \color{#8CD0D3}{\mathbf{light}}(Dir).\\
\color{#CC9393}{\mathbf{highlight}}(Dir_1, Obj)&\leftarrow&\\
&&\hspace{-6em} \color{#8CD0D3}{\mathbf{concave}}(Obj)\wedge \color{#8CD0D3}{\mathbf{light}}(Dir_2),\\
&&\hspace{-6em} \wedge opposite(Dir_1, Dir_2).
\end{eqnarray}

**Temple of the Foliated Cross**

It records some big events and their dates.

- Row 1-2: \(X\)
- Row 3-7: \(Y\)
- Row 8-9: \(Z\)

**Calculation**
\[ X\oplus Y=Z \]

- Col. I, III, V:
**Values**\(a_i\) - Col. II, IV, VI:
**Units**\(u_i\)

**Numbers**

- \(X=X_0\)
- \(Y=\sum_{i=3}^7 a_i\cdot u_i\)
- \(Z=\sum_{i=8}^9 a_i\cdot u_i\)

**Perception**:- \(\text{Glyphs}\) (image) \(\mapsto\) \(\text{Numbers}\) (symbol);

**Abductive Reasoning**:- Observation: the equations on tablet are
**correct**; - Background Knowledge:
- Structure: \(X\oplus Y=Z\)
- Calculation rules: 20-based \(\oplus\);

- Observation: the equations on tablet are
**Trial-and-errors**:- Until perception and reasoning are
**consistent**.

- Until perception and reasoning are

**Input**:- Examples: \(D=\{\langle \mathbf{x}_1,y_1\rangle,\ldots,\langle \mathbf{x}_m,y_m\rangle\}\);
- Background knowledge: \(KB\);
- Primitive symbols (
**pseudo**labels): \(\mathcal{P}=\{p_1,p_2,\ldots\}\);

- Primitive symbols (

**Output**: Hypothesis \(H=p\cup\Delta_C\);**Perception (machine learning) model**\(p:\mathcal{X}\mapsto \mathcal{P}\);**Knowledge (reasoning) model**\(\Delta_C\), where:\[ KB\cup\Delta_C\cup p(\mathbf{x}_i)\models y_i. \]

**Handwritten Equation Decipherment**:- Untrained perception model (CNN);
- Unknown operation rules: add / logical xor / etc.
- Learn
**perception**and**reasoning**jointly;

**Challenge**:- Labels for training perception model need to be inferred (abduced) by logic reasoning;
- Logical Reasoning require perceived symbols as input;

**Input**(with label of equation correctness):

**Output**:- Well-trained CNN \(p:\mathbb{R}^d\mapsto\{0,1,+,=\}\)
- Operation rules:
- e.g.
`1+1=10`

,`1+0=1`

,…(add);`1+1=0`

,`0+1=1`

,…(xor).

- e.g.

**Equation structure (DCG grammars)**:

- All equations are
`X+Y=Z`

; - Digits are lists of
`0`

and`1`

.

**Binary operation**:

- Calculated bit-by-bit, from the last to the first;
- Allow carries.

**Machine learning**:- Perceptions from
**raw data**⟶**primitive logic facts**;

- Perceptions from
**Logical abduction**:- Abduces
**pseudo label (primitive logic facts facts)**to re-train \(p\); - Learns logical rules \(\Delta_C\) to complete the reasoning from
**primitive logic facts**⟶**final concept**;

- Abduces
- Optimise the
**consistency**of hypothesis and data.

**Perception model**: Convolutional Neural Network**Abductive reasoning model**: Abductive Logic Programming**Consistency optimisation**: Derivative-Free Optimisation (RACOS)

**Intuition**:

- Maximise the number of instances in \(D\) that are
**consistent**with \(H\):

\begin{align}
\max\limits_{H=p\cup\Delta_C}\quad \text{Con}(H\cup D),
\end{align}

here \(\text{Con}(H\cup D)\) is the size of subset \(\hat{D}_C\in D\) consistent with \(H\):

\begin{align}
\hat{D}_C=\arg\max\limits_{D_c\subseteq D}\quad&\mid D_c\mid\label{eq:al:con}\\
\mathrm{s.t.}\quad&\forall \langle \mathbf{x}_i,y_i\rangle\in D_c\quad(KB\cup \Delta_C \cup p(\mathbf{x}_i)\models y_i).\nonumber
\end{align}

**When perception model $p$ is fixed**:

- Recognise
**pseudo-labels**\(p^t(\mathbf{x})=\cup_i p^t(\mathbf{x}_i)\) from raw data; - Since \(p\) is untrained (no ground truth label), \(p^t(\mathbf{x})\)
**might be wrong**; - Mark up the “possibly wrong” pseudo-labels \(\delta(p^t(X))\), where \(\delta\) is a function to
**guess which perceived symbols are wrong**. - Maximise
**consistency**by optimising \(\delta\);

\begin{align}
\max\limits_\delta\quad&\text{Con}(\delta(p^t(X))\cup\Delta_C \cup D)\label{eq:al:opt2}\\
s.t.\quad&\mid\delta(p^t(X))\mid\leq M\nonumber
\end{align}

- Abduce the revised pseudo-labels \(r_\delta(X)\) and reasoning model \(\Delta_C\) based on \(\delta\).

**When reasoning model $\Delta_C$ is fixed**

- Using revised pseudo-label \(r_\delta(X)\) to train perception model \(p^{t+1}\).

\begin{align}
p^{t+1}=\arg\min\limits_{p}\quad&\sum_{i=1}^mL(p(\mathbf{x}_i),r_\delta(\mathbf{x}_i))
\end{align}

```
%%%%%%%%%%%%%% LENGTH: 7 to 8 %%%%%%%%%%%%%%
This is the CNN's current label:
[[1, 2, 0, 1, 0, 1, 2, 0], [1, 1, 0, 1, 0, 1, 3, 3], [1, 1, 0, 1, 0, 1, 0, 3], [2, 0, 2, 1, 0, 1, 2], [1, 1, 0, 0, 0, 1, 2], [1, 0, 1, 1, 0, 1, 3, 0], [1, 1, 0, 3, 0, 1, 1], [0, 0, 2, 1, 0, 1, 1], [1, 3, 0, 1, 0, 1, 1], [1, 0, 1, 1, 0, 1, 3, 3]]
****Consistent instance:
consistent examples: [6, 8, 9]
mapping: {0: '+', 1: 0, 2: '=', 3: 1}
Current model's output:
00+1+00 01+0+00 0+00+011
Abduced labels:
00+1=00 01+0=00 0+00=011
Consistent percentage: 0.3
****Learned Rules:
rules: ['my_op([0],[0],[0,1])', 'my_op([1],[0],[0])', 'my_op([0],[1],[0])']
Train pool size is : 22
```

```
...
This is the CNN's current label:
[[1, 1, 0, 1, 2, 1, 3, 3], [1, 3, 0, 3, 2, 1, 3], [1, 0, 1, 1, 2, 1, 3, 3], [1, 1, 0, 1, 0, 1, 3, 3], [1, 0, 1, 1, 2, 1, 3, 3], [1, 1, 0, 1, 0, 1, 3, 3], [1, 0, 3, 3, 2, 1, 1], [1, 1, 0, 1, 2, 1, 3, 3], [1, 1, 0, 1, 2, 1, 3, 3], [3, 0, 1, 1, 2, 1, 1]]
****Consistent instance:
consistent examples: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
mapping: {0: '+', 1: 0, 2: '=', 3: 1}
Current model's output:
00+0=011 01+1=01 0+00=011 00+0=011 0+00=011 00+0=011 0+01=00 00+0=011 00+0=011 1+00=00
Abduced labels:
00+0=011 01+1=01 0+00=011 00+0=011 0+00=011 00+0=011 0+01=00 00+0=011 00+0=011 1+00=00
Consistent percentage: 1.0
****Learned feature:
rules: ['my_op([1],[0],[0])', 'my_op([0],[1],[0])', 'my_op([1],[1],[1])', 'my_op([0],[0],[0,1])']
Train pool size is : 77
```

**Data**: length 5-26 equations, each length 300 instances**DBA**: MNIST equations;**RBA**: Omniglot equations;**Binary addition**and**exclusive-or**.

**Compared methods**:**ABL-all**: Our approach with all training data**ABL-short**: Our approach with only**length 7-10**equations;**DNC**: Memory-based DNN;**Transformer**: Attention-based DNN;**BiLSTM**: Seq-2-seq baseline;

**Test Acc. vs Eq. length**

**Training Acc.**

**Reusing $p$ (L) vs reusing $\Delta_C$ (R)**

- No embedding/gradients; Utilises
**full-featured first-order logic**;- Better generalisation;
- Handles
**recursive knowledge**; - Takes advantage of over 60 years of
**symbolic AI**research directly;

- Abductive reasoning connects
**high-level reasoning**and**low-level perception**; - Abduction is neither sound or complete, humans/machines need
**trial-and-errors**. - The dividing line between
**high-level**and**low-level**is**unclear**, how to combine symbolic and sub-symbolic AI more efficiently is still an open question.