LiNGAM
Observing the similar problem formulation of linear Structural Equation Models (SEMs) and Independent Component Analysis (ICA), LiNGAM [1] comes, which recovers non-Gaussian noise and linear causal relations.
Identification Problem. LiNGAM series introduce a very interesting point for me, the identification problem. Because in Machine learning, people likes Gaussian distribution, but LiNGAM assumes non-Gaussian.The history of studying such identification problem can be traced back till the PCA and latent variable model. And since then, we have known that latent variable with Gaussian distribution is not identifiable and rotationally invariant.
Cost, Likelihood, and Optimisation. Moreover, another significant different point is that there is no conditional independence test and faithfulness assumption (which is hardly testable). ICA is conducted by the optimisation over some cost function. Here, a very important concept comes in, the optimisation. We hope to achieve causal discovery by "automatic" optimization rather than search over an unknown space. Thus, we do not need to refine procedures of an algorithm to reduce the complexity. We are using optimization to let algorithm search by itself following some principles.Always, the principle is the gradient of some cost or likelihood function. Sounds familiar, right? You may use this idea in neural network.
Score-based methods and No Tears
Score-based Causal Discovery. Using DAG to represent a causal graph has already existed in PC and FCI period.The willing of optimizing over the whole graph is not satisfied. Machine learning loves optimization and hates searching one by one. We have tons of tools for optimization. Then, GES somehow satisfies people with this willing. It proposes a score function to evaluate the structure of a DAG. Wow, amazing, we can optimize the score function to get the best fitted DAG. Here, score = f( DAG ). Pretty cool. Right? But we still need to enumerate all the possible DAG to pick up the best one. This is slow. Under some assumption, GES can be fast with a smart forward backward search algorithm which is not necessary to search all the possible DAGs. Some work proposes new better information theory flavoured score function. But it might be a tiny improvement compared with the big idea underlying the GES and score-based methods causal discovery.
No Tears. Don't cry? Yes. Enumerating all of the possible DAGs painful, but don't cry. We have a smarter tool which is waiting for us.Why we have already used optimization in GES series, but still feel not happy? Because of combinatory optimization.It is not differentiable! Seriously, this is not as Neural Network, which significantly limits its application. The fact is that we are lack of tools for combinatory optimization, therefore, the GES series is a sub-optimal solution. Luckily, we can transform a score function of DAG into a differentiable function, and then "neural network"! This is what NOTEARS has done. Marvellous! Optimizing a differentiable function. There can't be a better news for machine learning.
Liklihood Function, Differentiable Score Function, and Constrained Neural Network
Now all the ingredients are ready!
Constraint Neural Network and Identifiability. The identifiability of a model enables us to fit the best and that's it. This is not an easy thing. Therefore, we may limit a specific class of functions to prove the identifiability of the used models. As now is the period of neural network, we could design a constrained neural network to perform a specific class of functions. History convinces us: this is performing as sin and cos for Fourier transformation, ReLu for linearised approximation, and invertabiliy for normaiizlng flow.
Differentiable Likelihood Function. With the guarantee of the identifiability of the model, we can enjoy fitting a model and automatically get the best. The differentiable likelihood function tells us neural networks are waiting for you. So design your favourite flavoured likelihood function, and fit it with the data. Then, we done. But remember, differentiability and identifiability.
Best,
Ruibo
2020.04.15
Score-based methods and No Tears
Score-based Causal Discovery. Using DAG to represent a causal graph has already existed in PC and FCI period.The willing of optimizing over the whole graph is not satisfied. Machine learning loves optimization and hates searching one by one. We have tons of tools for optimization. Then, GES somehow satisfies people with this willing. It proposes a score function to evaluate the structure of a DAG. Wow, amazing, we can optimize the score function to get the best fitted DAG. Here, score = f( DAG ). Pretty cool. Right? But we still need to enumerate all the possible DAG to pick up the best one. This is slow. Under some assumption, GES can be fast with a smart forward backward search algorithm which is not necessary to search all the possible DAGs. Some work proposes new better information theory flavoured score function. But it might be a tiny improvement compared with the big idea underlying the GES and score-based methods causal discovery.
No Tears. Don't cry? Yes. Enumerating all of the possible DAGs painful, but don't cry. We have a smarter tool which is waiting for us.Why we have already used optimization in GES series, but still feel not happy? Because of combinatory optimization.It is not differentiable! Seriously, this is not as Neural Network, which significantly limits its application. The fact is that we are lack of tools for combinatory optimization, therefore, the GES series is a sub-optimal solution. Luckily, we can transform a score function of DAG into a differentiable function, and then "neural network"! This is what NOTEARS has done. Marvellous! Optimizing a differentiable function. There can't be a better news for machine learning.
Liklihood Function, Differentiable Score Function, and Constrained Neural Network
Now all the ingredients are ready!
Constraint Neural Network and Identifiability. The identifiability of a model enables us to fit the best and that's it. This is not an easy thing. Therefore, we may limit a specific class of functions to prove the identifiability of the used models. As now is the period of neural network, we could design a constrained neural network to perform a specific class of functions. History convinces us: this is performing as sin and cos for Fourier transformation, ReLu for linearised approximation, and invertabiliy for normaiizlng flow.
Differentiable Likelihood Function. With the guarantee of the identifiability of the model, we can enjoy fitting a model and automatically get the best. The differentiable likelihood function tells us neural networks are waiting for you. So design your favourite flavoured likelihood function, and fit it with the data. Then, we done. But remember, differentiability and identifiability.
Best,
Ruibo
2020.04.15