Wednesday, 15 April 2020

Identifiability and differentiable Likelihood function for Causal Discovery

One astonishing milestone of model causal discovery is PC and FCI. For a while, the conditional independence test is the main stream of causal discovery driven by PC and FCI. However, it is not as reliable as we thought and it has many limitations. Now challenges of causal discovery is still there. Luckily we have more powerful and various tools to tackle the "old".

LiNGAM

Observing the similar problem formulation of linear Structural Equation Models (SEMs) and Independent Component Analysis (ICA), LiNGAM [1] comes, which recovers non-Gaussian noise and linear causal relations. 

Identification Problem. LiNGAM series introduce a very interesting point for me, the identification problem. Because in Machine learning, people likes Gaussian distribution, but LiNGAM assumes non-Gaussian.The history of studying such identification problem can be traced back till the PCA and latent variable model. And since then, we have known that latent variable with Gaussian distribution is not identifiable and rotationally invariant.

Cost, Likelihood, and Optimisation. Moreover, another significant different point is that there is no conditional independence test and faithfulness assumption (which is hardly testable). ICA is conducted by the optimisation over some cost function. Here, a very important concept comes in, the optimisation. We hope to achieve causal discovery by "automatic" optimization rather than search over an unknown space. Thus, we do not need to refine procedures of an algorithm to reduce the complexity. We are using optimization to let algorithm search by itself following some principles.Always, the principle is the gradient of some cost or likelihood function. Sounds familiar, right? You may use this idea in neural network.

Score-based methods and No Tears
Score-based Causal Discovery. Using DAG to represent a causal graph has already existed in PC and FCI period.The willing of optimizing over the whole graph is not satisfied. Machine learning loves optimization and hates searching one by one. We have tons of tools for optimization. Then, GES somehow satisfies people with this willing. It proposes a score function to evaluate the structure of a DAG. Wow, amazing, we can optimize the score function to get the best fitted DAG. Here, score = f( DAG ). Pretty cool. Right? But we still need to enumerate all the possible DAG to pick up the best one. This is slow. Under some assumption, GES can be fast with a smart forward backward search algorithm which is not necessary to search all the possible DAGs. Some work proposes new better information theory flavoured score function. But it might be a tiny improvement compared with the big idea underlying the GES and score-based methods causal discovery.


No Tears. Don't cry? Yes. Enumerating all of the possible DAGs painful, but don't cry. We have a smarter tool which is waiting for us.Why we have already used optimization in GES series, but still feel not happy? Because of combinatory optimization.It is not differentiable! Seriously, this is not as Neural Network, which significantly limits its application. The fact is that we are lack of tools for combinatory optimization, therefore, the GES series is a sub-optimal solution. Luckily, we can transform a score function of DAG into a differentiable function, and then "neural network"! This is what NOTEARS has done. Marvellous! Optimizing a differentiable function. There can't be a better news for machine learning.

Liklihood Function, Differentiable Score Function, and Constrained Neural Network 

Now all the ingredients are ready!

Constraint Neural Network and Identifiability.  The identifiability of a model enables us to fit the best and that's it. This is not an easy thing. Therefore, we may limit a specific class of functions to prove the identifiability of the used models. As now is  the period of neural network, we could design a constrained neural network to perform a specific class of functions. History convinces us: this is performing as sin and cos for Fourier transformation, ReLu for linearised approximation, and invertabiliy for normaiizlng flow.

Differentiable Likelihood Function. With the guarantee of the identifiability of the model, we can enjoy fitting a model and automatically get the best. The differentiable likelihood function tells us neural networks are waiting for you. So design your favourite flavoured likelihood function, and fit it with the data. Then, we done. But remember, differentiability and identifiability.

Best,

Ruibo


2020.04.15

Tuesday, 14 January 2020

Notes of "DAGs with NO TEARS: Continuous Optimization for Structure Learning"

Notes of "DAGs with NO TEARS: Continuous Optimization for Structure Learning"


The paper contributed to the score-based causal discovery methods which are combinatorial constraint optimization problems given score functions. Luckily, the combinatorial constraint, i.e., the acyclicity, can be transformed into a continuous optimization problem by designing a score function of a matrix in linear cases to capture the acyclicity. Next, the problem can be solved by the standard numerical algorithms.

  •  In linear cases, this is an optimization problem over a matrix that represents graph structure. In (post-)nonlinear cases, the advantages may be disappeared.
  • What is the philosophy of score-based methods? Why an edge in this way can be a causal relation? Can we remove the acyclicity constraint, and then what does the result mean?
  • A score function should be coherent with the optimization step. In other words, the score of a graph structure should tell us more about the potential correct structure that we are looking for. This is the built-in information of a score function, which is not being used well. And then can we make use of it to search the structure? The gradient could be a useful guide, what could be the other types of such guide?
  • What could be the relation between AutoML and score-based methods?

Saturday, 14 September 2019

Fairness in ICML 2019 : “Optimal Decision Making under Strategic Behaviour”

Imagine one day, a huge number of decisions are made by machines. What would bother you?

Fairness in ICML 2019: “Learning Optimal Fair Policies”

I categorize Fairness in Machine Learning in two forms: 1) Static; 2) dynamic. The topics of static fairness are about prediction and classification. The topics of dynamic fairness are more about the policy of decision making and the feedback/impact of the decision. The difference is whether the algorithm considers the impact or feedback of the decision.

Confounding in Causality

Recently, I started reading papers about the confounding problem in causal inference. In general, it is impossible to get over the influence of confounding if only a treatment and an outcome are available. However, that is the beauty of research. And let’s how far we can go and where the state-of-art research is now.
I will start with two papers: 
  1. “On Multi-Cause Causal Inference with Unobserved Confounding: Counterexamples, Impossibility, and Alternatives”
  2. “Causal regularization”