Efficient Neural Network Approaches for Conditional Optimal Transport: Discussion and Reference

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Zheyu Oliver Wang, Department of Aeronautics and Astronautics, Massachusetts Institute of Technology, Cambridge, MA and [email protected];

(2) Ricardo Baptista, Computing + Mathematical Sciences, California Institute of Technology, Pasadena, CA and [email protected];

(3) Youssef Marzouk, Department of Aeronautics and Astronautics, Massachusetts Institute of Technology, Cambridge, MA and [email protected];

(4) Lars Ruthotto, Department of Mathematics, Emory University, Atlanta, GA and [email protected];

(5) Deepanshu Verma, Department of Mathematics, Emory University, Atlanta, GA and [email protected].

Table of Links

7. Discussion.

We present two measure transport approaches, PCP-Map and COT-Flow, that learn conditional distributions by approximately solving the static and dynamic conditional optimal transport problems, respectively. Specifically, penalizing transport costs in the learning problem provides unique optimal transport maps, known as the conditional Brenier map, between the target conditional distribution and the reference. Furthermore, for PCP-Map, minimizing the quadratic transport costs motivate us to exploit the structure of the Brenier map by constraining the search to monotone maps given as the gradient of convex potentials. Similarly, for COT flow this choice leads to a conservative vector field, which we enforce by design.

Our comparison to the SMC-ABC approach for the stochastic Lotka-Volterra problem shows common trade-offs when selecting conditional sampling approaches. Advantages of the ABC approach include its strong theoretical guarantees and well-known guidelines for choosing the involved hyper-parameters (annealing, burn-in, number of samples to skip to reduce correlation, etc.). The disadvantages are that ABC typically requires a large number of likelihood evaluations to produce (approximately) i.i.d. samples and produce low-variance estimators in high-dimensional parameter

spaces; the computation is difficult to parallelize in the sequential Monte Carlo setting, and the sampling process is not amortized over the conditioning variable y∗, i.e., it needs to be recomputed whenever y∗ changes.

Comparisons to the flow-based NPE method for the high-dimensional 1D shallow water equations problem illustrate the superior numerical accuracy achieved by our approaches. In terms of numerical efficiency, the PCP-Map approach, while providing a working computational scheme to the static COT problem, achieves significantly faster convergence than the amortized CP-Flow approach.

Learning posterior distributions using our techniques or similar measure transport approaches is attractive for real-world applications where samples from the joint distributions are available (or can be generated efficiently), but evaluating the prior density or the likelihood model is intractable. Common examples where a non-intrusive approach for conditional sampling can be fruitful include inverse problems where the predictive model involves stochastic differential equations (as in subsection 6.2) or legacy code and imaging problems where only prior samples are available.

Given the empirical nature of our study, we paid particular attention to the setup and reproducibility of our numerical experiments. To show the robustness of our approaches to hyperparameters and to provide guidelines for hyperparameter selection in future experiments, we report the results of a simple two-step heuristic that randomly samples hyperparameters and identifies the most promising configurations after a small number of training steps. We stress that the same search space of hyperparameters is used across all numerical experiments.

The results of the shallow water dataset subsection 6.3 indicate that both methods can learn high-dimensional COT maps. Here, the number of effective parameters in the dataset was n = 14, and the number of effective measurements was m = 3500. Particularly worth noting is that PCPMap, on average, converges in around 715 seconds on this challenging high-dimensional problem.

Since both approaches perform similarly in our numerical experiments, we want to comment on some distinguishing factors. One advantage of the PCP-Map approach is that it only depends on three hyperparameters (feature width, context width, and network depth), and we observed consistent performance for most choices. This feature is particularly attractive when experimenting with new problems. The limitation is that a carefully designed network architecture needs to be imposed to guarantee partial convexity of the potential. On the other hand, the value function (i.e., the velocity field) in COT-Flow can be designed almost arbitrarily. Thus, the latter approach may be

beneficial when new data types and their invariances need to be modeled, e.g., permutation invariances or symmetries, that might conflict with the network architecture required by the direct transport map. Both approaches also differ in terms of their numerical implementation. Training the PCP-Map via backpropagation is relatively straightforward, but sampling requires solving a convex program, which can be more expensive than integrating the ODE defined by the COT-Flow approach, especially when that model is trained well, and the velocity is constant along trajectories. Training the COT-Flow model, however, is more involved due to the ODE constraints.

Although our numerical evidence supports the effectiveness of our methods, important remaining limitations of our approaches include the absence of theoretical guarantees for sample efficiency and optimization. In particular, statistical complexity and approximation theoretic analysis for approximating COT maps using PCP-Map or COT-Flow in a conditional sampling context will be beneficial. We also point out that it can be difficult to quantify the produced samples’ accuracy without a benchmark method.

REFERENCES

[1] L. Ambrogioni, U. Guc¸l ¨ u, M. A. J. van Gerven, and E. Maris ¨ , The kernel mixture network: A nonparametric method for conditional density estimation of continuous random variables, 2017, https://arxiv.org/abs/1705.07111.

[2] B. Amos, L. Xu, and J. Z. Kolter, Input convex neural networks, in International Conference on Machine Learning, 2017, pp. 146–155.

[3] R. Baptista, B. Hosseini, N. B. Kovachki, and Y. Marzouk, Conditional sampling with monotone gans: from generative models to likelihood-free inference, 2023, https://arxiv.org/abs/2006.06755.

[4] R. Baptista, Y. Marzouk, and O. Zahm, On the representation and learning of monotone triangular transport maps, 2022, https://arxiv.org/abs/2009.10303.

[5] G. Batzolis, J. Stanczuk, C.-B. Schonlieb, and C. Etmann ¨ , Conditional image generation with score-based diffusion models, arXiv preprint arXiv:2111.13606, (2021).

[6] C. M. Bishop, Mixture density networks, Technical Report NCRG/94/004, Aston University, Department of Computer Science and Applied Mathematics, 1994., (1994).

[7] F. V. Bonassi and M. West, Sequential monte carlo with adaptive weights for approximate bayesian computation, Bayesian Analysis, 10 (2015), pp. 171–187.

[8] Y. Brenier, Polar factorization and monotone rearrangement of vector-valued functions, Communications on pure and applied mathematics, 44 (1991), pp. 375–417.

[9] C. Bunne, A. Krause, and M. Cuturi, Supervised training of conditional monge maps, in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, eds., vol. 35, Curran Associates, Inc., 2022, pp. 6859–6872.

[10] G. Carlier, V. Chernozhukov, and A. Galichon, Vector quantile regression: An optimal transport approach, Annals of Statistics, 44 (2014), pp. 1165–1192.

[11] R. T. Q. Chen, J. Behrmann, D. K. Duvenaud, and J.-H. Jacobsen, Residual flows for invertible generative modeling, in Advances in Neural Information Processing Systems, vol. 32, Curran Associates, Inc., 2019.

[12] K. Cranmer, J. Brehmer, and G. Louppe, The frontier of simulation-based inference, Proceedings of the National Academy of Sciences, 117 (2020), pp. 30055–30062.

[13] N. De Cao, W. Aziz, and I. Titov, Block neural autoregressive flow, in Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, vol. 115 of Proceedings of Machine Learning Research, PMLR, 22–25 Jul 2020, pp. 1263–1273.

[14] L. Dinh, J. Sohl-Dickstein, and S. Bengio, Density estimation using real NVP, in International Conference on Learning Representations, 2017.

[15] C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios, Neural spline flows, in Advances in Neural Information Processing Systems, vol. 32, Curran Associates, Inc., 2019.

[16] A. Figalli and F. Glaudo, An Invitation to Optimal Transport, Wasserstein Distances, and Gradient Flows, EMS Press, 2021.

[17] C. Finlay, J.-H. Jacobsen, L. Nurbekyan, and A. Oberman, How to train your neural ODE: the world of Jacobian and kinetic regularization, in Proceedings of the 37th International Conference on Machine Learning, vol. 119 of Proceedings of Machine Learning Research, PMLR, 13–18 Jul 2020, pp. 3154–3164.

[18] M. Grant, S. Boyd, and Y. Ye, Disciplined Convex Programming, Springer US, Boston, MA, 01 2006, pp. 155–210, https://doi.org/10.1007/0-387-30528-9 7.

[19] W. Grathwohl, R. T. Chen, J. Betterncourt, I. Sutskever, and D. Duvenaud, FFJORD: Free-form continuous dynamics for scalable reversible generative models, in International Conference on Learning Representations (ICLR), 2019.

[20] C.-W. Huang, R. T. Q. Chen, C. Tsirigotis, and A. Courville, Convex potential flows: Universal probability distributions with optimal transport and convex optimization, in International Conference on Learning Representations, 2021.

[21] C.-W. Huang, D. Krueger, A. Lacoste, and A. Courville, Neural autoregressive flows, in Proceedings of the 35th International Conference on Machine Learning, vol. 80 of Proceedings of Machine Learning Research, PMLR, 10–15 Jul 2018, pp. 2078–2087.

[22] N. J. Irons, M. Scetbon, S. Pal, and Z. Harchaoui, Triangular flows for generative modeling: Statistical consistency, smoothness classes, and fast rates, in International Conference on Artificial Intelligence and Statistics, 2021.

[23] J. Kim and Y. Kim, Parameterized convex universal approximators for decision-making problems, IEEE Transactions on Neural Networks and Learning Systems, (2022), pp. 1–12, https://doi.org/10.1109/TNNLS.2022.3190198.

[24] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, (2014).

[25] D. P. Kingma and P. Dhariwal, Glow: Generative flow with invertible 1x1 convolutions, in Advances in Neural Information Processing Systems, vol. 31, 2018.

[26] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling, Improved variational inference with inverse autoregressive flow, in Advances in Neural Information Processing Systems, vol. 29, 2016, pp. 4743–4751.

[27] I. Kobyzev, S. J. Prince, and M. A. Brubaker, Normalizing flows: An introduction and review of current methods, IEEE Transactions on Pattern Analysis and Machine Intelligence, 43 (2021), pp. 3964–3979, https://doi.org/10.1109/tpami.2020.2992934, https://doi.org/10.1109%2Ftpami.2020.2992934.

[28] J.-M. Lasry and P.-L. Lions, Mean field games, Japanese Journal of Mathematics, 2 (2007), pp. 229–260, https://doi.org/10.1007/s11537-007-0657-8.

[29] M. Lichman et al., UCI machine learning repository, 2013.

[31] Y. Lu and B. Huang, Structured output learning with conditional generative flows, Proceedings of the AAAI Conference on Artificial Intelligence, 34 (2020), pp. 5005–5012, https://doi.org/10.1609/aaai.v34i04.5940, https://ojs.aaai.org/index.php/AAAI/article/view/5940.

[32] A. Makkuva, A. Taghvaei, S. Oh, and J. Lee, Optimal transport mapping via input convex neural networks, in International Conference on Machine Learning, PMLR, 2020, pp. 6672–6681.

[33] Y. Marzouk, T. Moselhy, M. Parno, and A. Spantini, Sampling via measure transport: An introduction, Handbook of uncertainty quantification, 1 (2016), p. 2.

[34] M. Mirza and S. Osindero, Conditional generative adversarial nets, arXiv:1411.1784, (2014).

[35] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu, A Unified Framework for High Dimensional Analysis of M-Estimators with Decomposable Regularizers, Statistical Science, 27 (2012), pp. 538 – 557, https://doi.org/10.1214/12-STS400, https://doi.org/10.1214/12-STS400.

[36] D. Onken, S. Wu Fung, X. Li, and L. Ruthotto, OT-Flow: Fast and accurate continuous normalizing flows via optimal transport, Proceedings of the AAAI Conference on Artificial Intelligence, 35, https://par.nsf.gov/biblio/10232664.

[37] G. Papamakarios and I. Murray, Fast ϵ-free inference of simulation models with Bayesian conditional density estimation, in Advances in Neural Information Processing Systems, vol. 29, Curran Associates, Inc., 2016.

[38] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan, Normalizing flows for probabilistic modeling and inference, The Journal of Machine Learning Research, 22 (2021), pp. 2617–2680.

[39] G. Papamakarios, T. Pavlakou, and I. Murray, Masked autoregressive flow for density estimation, in Advances in Neural Information Processing Systems, vol. 30, Curran Associates, Inc., 2017.

[40] P. Ramesh, J.-M. Lueckmann, J. Boelts, A. Tejero-Cantero, D. S. Greenberg, P. J. Gonc¸alves, and J. H. Macke, GATSBI: Generative Adversarial Training for Simulation-Based Inference, in Proceedings of the 10th International Conference on Learning Representations (ICLR 2022), Online, 2022.

[41] J. Rothfuss, F. Ferreira, S. Walther, and M. Ulrich, Conditional density estimation with neural networks: Best practices and benchmarks, ArXiv, abs/1903.00954 (2019).

42] L. Ruthotto and E. Haber, An Introduction to Deep Generative Modeling, GAMM Mitteilungen, cs.LG (2021).

[43] L. Ruthotto, S. J. Osher, W. Li, L. Nurbekyan, and S. W. Fung, A machine learning framework for solving high-dimensional mean field game and mean field control problems, Proceedings of the National Academy of Sciences, 117 (2020), pp. 9183 – 9193, https://doi.org/10.1073/pnas.1922204117/-/dcsupplemental.

[44] M. Shiga, V. Tangkaratt, and M. Sugiyama, Direct conditional probability density estimation with sparse feature selection, Machine Learning, 100 (2015), pp. 161 – 182.

[45] K. Sohn, H. Lee, and X. Yan, Learning structured output representation using deep conditional generative models, in Advances in Neural Information Processing Systems, vol. 28, Curran Associates, Inc., 2015.

[46] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, Score-based generative modeling through stochastic differential equations, in International Conference on Learning Representations, 2021.

[47] A. Spantini, R. Baptista, and Y. Marzouk, Coupling techniques for nonlinear ensemble filtering, SIAM Review, 64 (2022), pp. 921–953, https://doi.org/10.1137/20M1312204.

[48] K. Tang, X. Wan, and Q. Liao, Deep density estimation via invertible block-triangular mapping, Theoretical and Applied Mechanics Letters, 10 (2020), pp. 143–148.

[49] Y. Tashiro, J. Song, Y. Song, and S. Ermon, CSDI: Conditional score-based diffusion models for probabilistic time series imputation, in Neural Information Processing Systems, 2021.

[50] A. Tejero-Cantero, J. Boelts, M. Deistler, J.-M. Lueckmann, C. Durkan, P. J. Gonc¸alves, D. S. Greenberg, and J. H. Macke, sbi: A toolkit for simulation-based inference, Journal of Open Source Software, 5 (2020), p. 2505.

[51] A. Tong, N. Malkin, G. Huguet, Y. Zhang, J. Rector-Brooks, K. FATRAS, G. Wolf, and Y. Bengio, Improving and generalizing flow-based generative models with minibatch optimal transport, in ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023.

[52] C. Villani et al., Optimal transport: old and new, vol. 338, Springer, 2009.

[53] A. Wehenkel and G. Louppe, Unconstrained monotonic neural networks, in Advances in Neural Information Processing Systems, vol. 32, 2019.

[54] J. Whang, E. Lindgren, and A. Dimakis, Composing normalizing flows for inverse problems, in International Conference on Machine Learning, PMLR, 2021, pp. 11158–11169.

[55] D. J. Wilkinson, Stochastic modelling for systems biology, CRC press, 2018.

[56] L. Yang and G. E. Karniadakis, Potential flow generator with l2 optimal transport regularity for generative models, IEEE Transactions on Neural Networks and Learning Systems, 33 (2022), pp. 528–538, https://doi.org/10.1109/TNNLS.2020.3028042.

[57] B. J. Zhang and M. A. Katsoulakis, A mean-field games laboratory for generative modeling, arXiv, (2023), https://arxiv.org/abs/2304.13534.

[58] L. Zhang, W. E, and L. Wang, Monge-Amp`ere flow for generative modeling, 2018, https://arxiv.org/abs/1809.10188.