Abductive Learning

Towards Bridging Machine Learning and Logical Reasoning

(Press ? for help, n and p for next and previous slide)

Wang-Zhou Dai

Department of Computing, Imperial College London

Wednesday Aug 28th, 2019

Machine Perception
and Reasoning

Machine Learning

... now is VERY good at

Image Recognition

Speech Recognition

Mapping sensory information to a concept.

Machine Learning

... now is NOT very good at

Visual Question Answering

Learning (Simple) Relations

Machine Learning

... now is way worse than human at

QA in Context

Question: Solve -42*r + 27*c = -1167 and 130*r + 4*c = 372 for r.
Answer: 4

Question: Calculate -841880142.544 + 411127.
Answer: -841469015.544

Question: Let x(g) = 9*g + 1. Let q(c) = 2*c + 1. Let f(i) = 3*i - 39. Let w(j) = q(x(j)). Calculate f(w(a)).
Answer: 54*a - 30

Question: Let e(l) = l - 6. Is 2 a factor of both e(9) and 2?
Answer: False

Question: Let u(n) = -n**3 - n**2. Let e(c) = -2*c**3 + c. Let l(j) = -118*e(j) + 54*u(j). What is the derivative of l(a)?
Answer: 546*a**2 - 108*a - 118

Question: Three letters picked without replacement from qqqkkklkqkkk. Give prob of sequence qql.
Answer: 1/110

Do Maths in Natural Language

A Modern Rephrase of Curve Fitting

Empirical Risk

\begin{equation} R_{emp}=\frac{1}{n}\sum_{i=1}^n L(h(\mathbf{x}_i),y_i) \end{equation}

Empirical Risk Minimisation

\begin{equation} \hat{h}=\text{arg}\min_{h\in\mathcal{H}}R_{emp}(h) \end{equation}

Treating Reasoning as Perception

(Thys et.al., 2019)

(Goodfellow et.al., 2015)

Treating Reasoning as Perception

Hard to tell machines what we know;
Hard to understand machines from what they learned.

Machine Reasoning

... is the very first task in AI

Logic Theorist

Machine Reasoning

General Problem Solving
Automated Theorem Proving
Boolean Satisfiability
Expert Systems
Logic Programming
Inductive Logic Programming
Probabilistic Logic Programming
Constraint Logic Programming, Answer Set Programming
…

Symbolism

 A physical symbol system has the necessary and sufficient means for general intelligent action. 

— Allen Newell and Herbert A. Simon, 1975.

40 years later ...

 Real objects seldom wear unique identifiers or preannounce their existence like the cast of a play. 

— Stuart Russell, 2015.

The Separated Perception and Reasoning

Pareidolia

Museum of the moon, NHM

Bridging Machine Learning and Reasoning

Too Many Attempts

Multi-valued / Fuzzy Logic;
- How to define “$\rightarrow$” (implication);
Statistical Relational Learning &
Probabilistic Logic Programming;
- Probabilistic Graphical Model;
- Independence assumptions, pseudo-likelihood
Neural Symbolic Learning;
- Embedding representation;
- Fuzzy logic operators;
Abductive Reasoning

[Dai and Zhou, 2017]

I have been working on this topic for more than 8 years. When I was undergraduate, I am read some books about multivalued and fuzzy logic, they try to model different levels of truth values or even make it continuous. Then I found it’s tricky to re-define the implication in these systems, seems that everyone have a different way to interpret the implication symbol. Started from my master study, I have tried Statistical Relational Learning and Probabilistic Logic Programming, and we’ve implemented it in search engine to learn semantic parsing. Because they are boolean valued probabilistic model, the learning complexity is extremely high, we need to enumerate the graphical model structure and learn parameters repeatedly, and difficult to converge. Moreover, most of them have to assume mutual independence to do inference, making recursive reasoning inaccurate. During the time I worked in Baidu, deep learning and word embeddings start to be popular, we tried some neural symbolic learning stuff, but find out that using embeddings makes model difficult to generalise. For a dynamic internet environment today, new words and new events appears everyday, this really brings a lot of problems. Moreover, the fuzzy logic operators also brought some problems. Then, during my PhD study, I found abductive reasoning very interesting, and we discovered that it could be a good way to combine learning and reasoning. Here is an example on representation learning: the left figure is the features learned by sparse coding, the right one is learned by considering recursive logical rules about how do people write.

Deduction Abduction

Explain (specific) observations based on (general) background knowledge. \[ B\cup \color{#8CD0D3}{\mathbf{H}}\models \color{#CC9393}{\mathbf{E}} \]

\begin{eqnarray} \color{#CC9393}{\mathbf{highlight}}(Dir, Obj) &\leftarrow&\\ &&\hspace{-6em} \color{#8CD0D3}{\mathbf{convex}}(Obj)\wedge \color{#8CD0D3}{\mathbf{light}}(Dir).\\ \color{#CC9393}{\mathbf{highlight}}(Dir_1, Obj)&\leftarrow&\\ &&\hspace{-6em} \color{#8CD0D3}{\mathbf{concave}}(Obj)\wedge \color{#8CD0D3}{\mathbf{light}}(Dir_2),\\ &&\hspace{-6em} \wedge opposite(Dir_1, Dir_2). \end{eqnarray}

A Human Example

The Mayan Calendars

Temple of the Foliated Cross

Tablet in the temple

It records some big events and their dates.

The calendar

Structure

Row 1-2: $X$
Row 3-7: $Y$
Row 8-9: $Z$

Calculation \[ X\oplus Y=Z \]

The calendar

Glyphs

Col. I, III, V: Values $a_i$
Col. II, IV, VI: Units $u_i$

Numbers

$X=X_0$
$Y=\sum_{i=3}^7 a_i\cdot u_i$
$Z=\sum_{i=8}^9 a_i\cdot u_i$

The “head variants”

Cracking the glyphs

[Bowditch, 1901]

Cracking the glyphs

Perception:
- $\text{Glyphs}$ (image) $\mapsto$ $\text{Numbers}$ (symbol);
Abductive Reasoning:
- Observation: the equations on tablet are correct;
- Background Knowledge:
  - Structure: $X\oplus Y=Z$
  - Calculation rules: 20-based $\oplus$;
Trial-and-errors:
- Until perception and reasoning are consistent.

Abductive Learning

The Framework

Input:
- Examples: $D=\{\langle \mathbf{x}_1,y_1\rangle,\ldots,\langle \mathbf{x}_m,y_m\rangle\}$;
- Background knowledge: $KB$;
  - Primitive symbols (pseudo labels): $\mathcal{P}=\{p_1,p_2,\ldots\}$;
Output: Hypothesis $H=p\cup\Delta_C$;
- Perception (machine learning) model $p:\mathcal{X}\mapsto \mathcal{P}$;
- Knowledge (reasoning) model $\Delta_C$, where:
  \[ KB\cup\Delta_C\cup p(\mathbf{x}_i)\models y_i. \]

A Simplified Task

Handwritten Equation Decipherment:
- Untrained perception model (CNN);
- Unknown operation rules: add / logical xor / etc.
- Learn perception and reasoning jointly;
Challenge:
- Labels for training perception model need to be inferred (abduced) by logic reasoning;
- Logical Reasoning require perceived symbols as input;

Handwritten Equation Decipherment

Input (with label of equation correctness):

Output:
- Well-trained CNN $p:\mathbb{R}^d\mapsto\{0,1,+,=\}$
- Operation rules:
  - e.g. 1+1=10, 1+0=1,…(add); 1+1=0, 0+1=1,…(xor).

Background knowledge 1

Equation structure (DCG grammars):

All equations are X+Y=Z;
Digits are lists of 0 and 1.

Background knowledge 2

Binary operation:

Calculated bit-by-bit, from the last to the first;
- Allow carries.

Model Structure

Machine learning:
- Perceptions from raw data ⟶ primitive logic facts;
Logical abduction:
- Abduces pseudo label (primitive logic facts facts) to re-train $p$;
- Learns logical rules $\Delta_C$ to complete the reasoning from primitive logic facts ⟶ final concept;
Optimise the consistency of hypothesis and data.

Implementation

Perception model: Convolutional Neural Network
Abductive reasoning model: Abductive Logic Programming
Consistency optimisation: Derivative-Free Optimisation (RACOS)

Formulation

Intuition:

Maximise the number of instances in $D$ that are consistent with $H$:

\begin{align} \max\limits_{H=p\cup\Delta_C}\quad \text{Con}(H\cup D), \end{align}

here $\text{Con}(H\cup D)$ is the size of subset $\hat{D}_C\in D$ consistent with $H$:

\begin{align} \hat{D}_C=\arg\max\limits_{D_c\subseteq D}\quad&\mid D_c\mid\label{eq:al:con}\\ \mathrm{s.t.}\quad&\forall \langle \mathbf{x}_i,y_i\rangle\in D_c\quad(KB\cup \Delta_C \cup p(\mathbf{x}_i)\models y_i).\nonumber \end{align}

Optimisation in sketch

When perception model $p$ is fixed:

Recognise pseudo-labels $p^t(\mathbf{x})=\cup_i p^t(\mathbf{x}_i)$ from raw data;
Since $p$ is untrained (no ground truth label), $p^t(\mathbf{x})$ might be wrong;
Mark up the “possibly wrong” pseudo-labels $\delta(p^t(X))$, where $\delta$ is a function to guess which perceived symbols are wrong.
Maximise consistency by optimising $\delta$;

\begin{align} \max\limits_\delta\quad&\text{Con}(\delta(p^t(X))\cup\Delta_C \cup D)\label{eq:al:opt2}\\ s.t.\quad&\mid\delta(p^t(X))\mid\leq M\nonumber \end{align}

Abduce the revised pseudo-labels $r_\delta(X)$ and reasoning model $\Delta_C$ based on $\delta$.

Optimisation in sketch

When reasoning model $\Delta_C$ is fixed

Using revised pseudo-label $r_\delta(X)$ to train perception model $p^{t+1}$.

\begin{align} p^{t+1}=\arg\min\limits_{p}\quad&\sum_{i=1}^mL(p(\mathbf{x}_i),r_\delta(\mathbf{x}_i)) \end{align}

Wrap-it-up

Training Log

%%%%%%%%%%%%%% LENGTH:  7  to  8 %%%%%%%%%%%%%%
This is the CNN's current label:
[[1, 2, 0, 1, 0, 1, 2, 0], [1, 1, 0, 1, 0, 1, 3, 3], [1, 1, 0, 1, 0, 1, 0, 3], [2, 0, 2, 1, 0, 1, 2], [1, 1, 0, 0, 0, 1, 2], [1, 0, 1, 1, 0, 1, 3, 0], [1, 1, 0, 3, 0, 1, 1], [0, 0, 2, 1, 0, 1, 1], [1, 3, 0, 1, 0, 1, 1], [1, 0, 1, 1, 0, 1, 3, 3]]
****Consistent instance:
consistent examples: [6, 8, 9]
mapping: {0: '+', 1: 0, 2: '=', 3: 1}
Current model's output:
00+1+00 01+0+00 0+00+011
Abduced labels:
00+1=00 01+0=00 0+00=011
Consistent percentage: 0.3
****Learned Rules:
rules:  ['my_op([0],[0],[0,1])', 'my_op([1],[0],[0])', 'my_op([0],[1],[0])']

Train pool size is : 22

Training Log

...
This is the CNN's current label:
[[1, 1, 0, 1, 2, 1, 3, 3], [1, 3, 0, 3, 2, 1, 3], [1, 0, 1, 1, 2, 1, 3, 3], [1, 1, 0, 1, 0, 1, 3, 3], [1, 0, 1, 1, 2, 1, 3, 3], [1, 1, 0, 1, 0, 1, 3, 3], [1, 0, 3, 3, 2, 1, 1], [1, 1, 0, 1, 2, 1, 3, 3], [1, 1, 0, 1, 2, 1, 3, 3], [3, 0, 1, 1, 2, 1, 1]]
****Consistent instance:
consistent examples: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
mapping: {0: '+', 1: 0, 2: '=', 3: 1}
Current model's output:
00+0=011 01+1=01 0+00=011 00+0=011 0+00=011 00+0=011 0+01=00 00+0=011 00+0=011 1+00=00
Abduced labels:
00+0=011 01+1=01 0+00=011 00+0=011 0+00=011 00+0=011 0+01=00 00+0=011 00+0=011 1+00=00
Consistent percentage: 1.0
****Learned feature:
rules:  ['my_op([1],[0],[0])', 'my_op([0],[1],[0])', 'my_op([1],[1],[1])', 'my_op([0],[0],[0,1])']

Train pool size is : 77

Experimental Results

Setting

Data: length 5-26 equations, each length 300 instances
- DBA: MNIST equations;
- RBA: Omniglot equations;
- Binary addition and exclusive-or.

Compared methods:
- ABL-all: Our approach with all training data
- ABL-short: Our approach with only length 7-10 equations;
- DNC: Memory-based DNN;
- Transformer: Attention-based DNN;
- BiLSTM: Seq-2-seq baseline;

Prediction Accuracy

Test Acc. vs Eq. length

Mutual Beneficial Perception & Reasoning

Training Acc.

Model Reuse

Reusing $p$ (L) vs reusing $\Delta_C$ (R)

Conclusion

Take-home message

No embedding/gradients; Utilises full-featured first-order logic;
- Better generalisation;
- Handles recursive knowledge;
- Takes advantage of over 60 years of symbolic AI research directly;
Abductive reasoning connects high-level reasoning and low-level perception;
Abduction is neither sound or complete, humans/machines need trial-and-errors.
The dividing line between high-level and low-level is unclear, how to combine symbolic and sub-symbolic AI more efficiently is still an open question.

Human-Like Computing

Each Play to Their Strengths

: https://github.com/AbductiveLearning/ABL-HED

Abductive Learning

Machine Perception and Reasoning

Machine Learning

Machine Learning

Machine Learning

A Modern Rephrase of Curve Fitting

Treating Reasoning as Perception

Treating Reasoning as Perception

Machine Reasoning

Machine Reasoning

Symbolism

The Separated Perception and Reasoning

Bridging Machine Learning and Reasoning

Too Many Attempts

Deduction Abduction

A Human Example

The Mayan Calendars

Tablet in the temple

The calendar

The calendar

The “head variants”

Cracking the glyphs

Cracking the glyphs

Abductive Learning

The Framework

A Simplified Task

Handwritten Equation Decipherment

Background knowledge 1

Background knowledge 2

Model Structure

Implementation

Formulation

Optimisation in sketch

Optimisation in sketch

Wrap-it-up

Training Log

Training Log

Experimental Results

Setting

Prediction Accuracy

Mutual Beneficial Perception & Reasoning

Model Reuse

Conclusion

Take-home message

Human-Like Computing

Each Play to Their Strengths

Machine Perception
and Reasoning