# Differences

This shows you the differences between two versions of the page.

people:xiaotian_yu [2017/08/31 17:47] xtyu |
people:xiaotian_yu [2017/09/03 22:20] (current) xtyu |
||
---|---|---|---|

Line 33: | Line 33: | ||

- | ===== References on Machine Learning and Optimization ===== | + | ===== Books on Optimization, Online Learning and Deep Learning ===== |

==== Deterministic Optimization ==== | ==== Deterministic Optimization ==== | ||

Line 60: | Line 60: | ||

3. Hazan, Elad. Introduction to online convex optimization. (2016)\\ | 3. Hazan, Elad. Introduction to online convex optimization. (2016)\\ | ||

http://ocobook.cs.princeton.edu/OCObook.pdf | http://ocobook.cs.princeton.edu/OCObook.pdf | ||

+ | |||

+ | |||

+ | ==== Learning Theory and Bandits ==== | ||

+ | 1. Cesa-Bianchi, Nicolo, and Gábor Lugosi. Prediction, learning, and games. (2006)\\ | ||

+ | http://www.ii.uni.wroc.pl/~lukstafi/pmwiki/uploads/AGT/Prediction_Learning_and_Games.pdf | ||

+ | |||

+ | 2. Bubeck, Sébastien, and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. (2012)\\ | ||

+ | https://arxiv.org/pdf/1204.5721.pdf | ||

+ | |||

+ | 3. Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. (2014)\\ | ||

+ | http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf | ||

+ | |||

+ | |||

+ | ==== Deep Learning ==== | ||

+ | |||

+ | |||

+ | ===== Xixian's Ideas ===== | ||

+ | |||

+ | ==== Something about the conference paper ==== | ||

+ | |||

+ | 1. ICML: 1629 submissions, 434 accepted, 9 parallel tracks \\ | ||

+ | |||

+ | 2. Ranking based on the number of the accepted paper: neural network + deep learning, optimization, reinforcement learning, supervised leaning, big data + large scale learning, unsupervised learning, generative model, online learning\\ | ||

+ | |||

+ | 3. Best paper: Understanding Black-box Predictions via Influence Functions\\ | ||

+ | |||

+ | 4. Test of Time Award: Combining Online and Offline Knowledge in UCT \\ | ||

+ | |||

+ | ==== Matrix and Tensor ==== | ||

+ | |||

+ | 1. Follow the Compressed Leader: Faster Online Learning of Eigenvectors and Faster MMWU \\ | ||

+ | (new techniques about fast calculating eigenvectors and reducing dimension, can be used in a bunch of bunch of work associated with random projection such as clustering, bandit, regression, SVM, community detection with dimension reduction) | ||

+ | |||

+ | 2. Efficient approximate sampling for projection dpps \\ | ||

+ | (DPP can be used to improve the kernel learning - Fast DPP Sampling for Nystrom with Application to Kernel Methods, recommendation system - Bayesian Low-Rank Determinantal Point Processes) | ||

+ | |||

+ | 3. An Efficient, Sparsity-preserving, Online Algorithm for Low-Rank Approximation \\ | ||

+ | (new things about LU decomposition, can be used for fast approximating SVD and many other work w.r.t matrix approximation) | ||

+ | |||

+ | 4. Approximate Newton Methods and Their Local Convergence \\ | ||

+ | (this paper just improves the theoretical results about fast newton methods via sampling, but projections seem to yield better results) | ||

+ | |||

+ | 5. Tensor Decomposition via Simultaneous Power Iteration \\ | ||

+ | (Power Iteration is still time consuming, and cannot take the advantage of the sparsity of data) | ||

+ | |||

+ | 6. Efficient Distributed Learning with sparsity \\ | ||

+ | (matrix approximation with sparsity guaranteed can be used to improve the tradeoff) | ||

+ | |||

+ | |||

+ | 7. Kernelized Support Tensor Machines \\ | ||

+ | (this paper resembles the tensor machine with random feature of polynomial kernels) | ||

+ | |||

+ | 8. Distributed Mean Estimation with Limited Communication \\ | ||

+ | (the techniques in my paper can be used to improve this task) | ||

+ | |||

+ | 9. Sketched Ridge Regression: Optimization and Statistical Perspectives \\ | ||

+ | (the involved Model Averaging is interesting, and it may be better to use weighted averaging) | ||

+ | |||

+ | 10. Beyond Filters: Compact Feature Map for Portable Deep Model \\ | ||

+ | (this paper uses fast dimension reduction, how about the hashing techniques?) | ||

+ | |||

+ | 11. ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning\\ (this paper reduces the bit number to reduce the communication burden with the expected estimator guaranteed, and it is promising to combine randomized techniques to improve this paper) | ||

+ | |||

+ | 12. Spherical Structured Feature Maps for Kernel Approximation \\ | ||

+ | (still time consuming) | ||

+ | |||

+ | 13. Nystrom Method with Kernel K-means++ Samples as Landmarks\\ | ||

+ | |||

+ | 14. Joint Embedding Models for Textual and Social Analysis \\ | ||

+ | |||

+ | 15. Effective Sketching methods for value function approximation \\ | ||

+ | |||

+ | 16. Triply stochastic gradients on multiple kernel learning \\ | ||

+ | |||

+ | 17. DPP for mini-batch diversification \\ | ||

+ | |||

+ | ==== Distributed ==== | ||

+ | 1. Distributed Batch Gaussian Process Optimization \\ | ||

+ | (the key step is to approximate the kernel matrix by many small blocks and how about using advanced approximation with/without regularization? Moreover, it is still difficult to extend the techniques of this paper for Laplace rather Gaussian)\\ | ||

+ | |||

+ | 2. Communication efficient distributed primal-dual algorithms for saddle point problem | ||

+ | |||

+ | ==== Optimization ==== | ||

+ | 1. The bounds of ‘Model-Independent Online Learning for Influence Maximization’ and | ||

+ | ‘Online Learning to Rank in Stochastic Click Models’ are not tight enough. \\ | ||

+ | |||

+ | 2. High-dimensional variance-reduced stochastic gradient expectation-maximization algorithms \\ | ||

+ | (this new paper = old optimization techniques solving other old problems) | ||

+ | |||

+ | 3. sub-sampled cubic regularization for non-convex optimization \\ | ||

+ | (sampling is expected to theoretically fail in the non-convex optimization, but this paper still let it work in some specific non-convex optimization problems. Can we relax the constraint a bit, use advanced sampling, or get more tight results? Refer to some related work in convex optimization) | ||

+ | |||

+ | 4. Adaptive sampling probabilities for non-smooth optimization\\ | ||

+ | |||

+ | 5. Second-Order Kernel Online Convex Optimization with Adaptive Sketching \\ | ||

+ | |||

+ | 6. adaptive feature selection: computationally efficient online sparse linear regression under rip \\(RIP may not be required to get the best summation of bias error and variance error) | ||

+ | |||

+ | 7. Natasha: Faster Non-Convex Stochastic Optimization via Strongly Non-Convex Parameter \\ | ||

+ | (this paper focuses on the eigenvalues of Hessian matrix, and only derive the desired results when the eigenvalues are bigger than some values. So it is significant to relax this constraint.) | ||

+ | |||

+ | 8. Coupling adaptive batch sizes with learning rates \\ | ||

+ | (interesting, is it easy to be extended to other learning problems?) | ||

+ | |||

+ | |||

+ | ==== Bayes & Gaussian Process ==== | ||

+ | |||

+ | 1. Stochastic Gradient Descent as Approximate Bayesian Inference | ||

+ | |||

+ | 2. Importance sampled stochastic optimization for variational inference \\ | ||

+ | (those two work show the optimization to approximate posterior distribution, a recent research direction) | ||

+ | |||

+ | ==== Deep Learning ==== | ||

+ | 1. Deep Transfer Learning with Joint Adaptation Networks \\ | ||

+ | (the key step in Maximum Mean Discrepancy in distribution matching has a large variance, so we can reduce this variance) | ||

+ | |||

+ | 2. on orthogonality and learning recurrent networks with long term dependencies \\ | ||

+ | (the key step orthogonality has used svd for batch learning，which is time consuming and bad for generation ability) | ||

+ | |||

+ | 3. Random Feature Expansions for Deep Gaussian Processes \\ | ||

+ | (at least many other advanced Random Feature Expansions can be used in this scheme) | ||

+ |