Proximal gradient method for huberized support vector machine

Xu, Yangyang; Akrotirianakis, Ioannis; Chakraborty, Amit

doi:10.1007/s10044-015-0485-z

Proximal gradient method for huberized support vector machine

Theoretical Advances
Published: 31 May 2015

Volume 19, pages 989–1005, (2016)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Yangyang Xu¹,
Ioannis Akrotirianakis² &
Amit Chakraborty²

603 Accesses
20 Citations
2 Altmetric
Explore all metrics

Abstract

The support vector machine (SVM) has been used in a wide variety of classification problems. The original SVM uses the hinge loss function, which is non-differentiable and makes the problem difficult to solve in particular for regularized SVMs, such as with $\ell _1$-regularization. This paper considers the Huberized SVM (HSVM), which uses a differentiable approximation of the hinge loss function. We first explore the use of the proximal gradient (PG) method to solving binary-class HSVM (B-HSVM) and then generalize it to multi-class HSVM (M-HSVM). Under strong convexity assumptions, we show that our algorithm converges linearly. In addition, we give a finite convergence result about the support of the solution, based on which we further accelerate the algorithm by a two-stage method. We present extensive numerical experiments on both synthetic and real datasets which demonstrate the superiority of our methods over some state-of-the-art methods for both binary- and multi-class SVMs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Bulgaria)

Instant access to the full article PDF.

Institutional subscriptions

Alternating Direction Method of Multipliers for Regularized Multiclass Support Vector Machines

Sparse Proximal Support Vector Machine with a Specialized Interior-Point Method

Article 26 February 2015

A proximal quadratic surface support vector machine for semi-supervised binary classification

Article 09 August 2017

Notes

Strictly speaking, we add the term $\frac{\lambda _3}{2}b^2$ in (2). This modification is similar to that used in [22], and the extra term usually does not affect the prediction but makes PG method converge faster.
Without strong convexity of $\xi _2$, we only have sublinear convergence as shown in [1].
Our algorithm and ADMM are both implemented in MATLAB, while the code of GCD is written in R. To be fair, we re-wrote the code of GCD and implemented it in MATLAB.
The paper [30] uses a path-following method to solve (3). However, its code is not publicly available.
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets.
http://www.cs.cmu.edu/~tom/fmri.html.
The Wilcoxon signed-ranks test is in general better than the paired $t$-test as demonstrated in [9].
The code of DDAG is available from http://theoval.cmp.uea.ac.uk/svm/toolbox/.
Our reported accuracies are lower than those in [20] because we used fewer training samples.

References

Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2(1):183–202
Article MathSciNet MATH Google Scholar
Bolte J, Daniilidis A, Lewis A (2007) The lojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J Optim 17(4):1205–1223
Article MathSciNet MATH Google Scholar
Bottou L, Cortes C, Denker JS, Drucker H, Guyon I, Jackel LD, LeCun Y, Muller UA, Sackinger E, Simard P et al (1994) Comparison of classifier methods: a case study in handwritten digit recognition. In: Proceedings of the 12th IAPR International Conference on Pattern Recognition, vol 2, pp 77–82. IEEE
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Book MATH Google Scholar
Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Haussler D (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Nat Acad Sci 97(1):262
Article Google Scholar
Chen Zhen-Yu, Fan Zhi-Ping, Sun Minghe (2012) A hierarchical multiple kernel support vector machine for customer churn prediction using longitudinal behavioral data. Eur J Oper Res 223(2):461–472
Article MathSciNet MATH Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Czarnecki WM, Tabor J (2014) Two ellipsoid support vector machines. Exp Syst Appl 41:8211–8224
Article Google Scholar
Demšar Janez (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97(457):77–87
Article MathSciNet MATH Google Scholar
Fan Rong-En, Chang Kai-Wei, Hsieh Cho-Jui, Wang Xiang-Rui, Lin Chih-Jen (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874
MATH Google Scholar
Friedman Milton (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701
Article MATH Google Scholar
Friedman Milton (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1):86–92
Article MathSciNet MATH Google Scholar
Galar M, Fernandez A, Barrenechea E, Herrera F (2015) Drcw-ovo: Distance-based relative competence weighting combination for one-vs-one strategy in multiclass problems. Pattern Recognit 48:28–42
Article Google Scholar
Galar Mikel, Fernández Alberto, Barrenechea Edurne, Bustince Humberto, Herrera Francisco (2011) An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes. Pattern Recognit 44(8):1761–1776
Article Google Scholar
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Article Google Scholar
Hale ET, Yin W, Zhang Y (2008) Fixed-point continuation for $l_1$-minimization: methodology and convergence. SIAM J Optim 19(3):1107–1130
Article MathSciNet MATH Google Scholar
Heisele B, Ho P, Poggio T (2001) Face recognition with support vector machines: Global versus component-based approach. In: Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, vol 2, pp 688–694. IEEE
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 65–70
Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. Neural Netw IEEE Trans 13(2):415–425
Article Google Scholar
CVX Research, Inc. CVX: Matlab software for disciplined convex programming, version 2.0 beta (2012). http://cvxr.com/cvx, Sept 2012
Keerthi SS, DeCoste D (2006) A modified finite newton method for fast solution of large scale linear svms. J Mach Learn Res 6(1):341
MathSciNet MATH Google Scholar
Khan J, Wei JS, Ringnér M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C et al (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7(6):673–679
Article Google Scholar
Kim KI, Jung K, Park SH, Kim HJ (2002) Support vector machines for texture classification. Pattern Anal Mach Intell IEEE Trans 24(11):1542–1550
Article Google Scholar
Knerr S, Personnaz L, Dreyfus G (1990) Single-layer learning revisited: a stepwise procedure for building and training a neural network. In: Neurocomputing, pp 41–50. Springer
Krawcczyk B, Wozniak M, Cyganek B (2014) Clustering-based ensembles for one-class classification. Inf Sci 264:182–195
Article MathSciNet MATH Google Scholar
Kurdyka K (1998) On gradients of functions definable in o-minimal structures. In: Annales de l’institut Fourier, vol 48, pp 769–784. Chartres: L’Institut, 1950
Lacoste-Julien S, Jaggi M, Schmidt M, Pletscher P (2012) Block-coordinate frank-wolfe optimization for structural svms. arXiv preprint arXiv:1207.4747
Lee Y, Lin Y, Wahba G (2004) Multicategory support vector machines. J Am Stat Assoc 99(465):67–81
Article MathSciNet MATH Google Scholar
Li J, Jia Y (2010) Huberized multiclass support vector machine for microarray classification. Acta Autom Sin 36(3):399–405
MathSciNet MATH Google Scholar
Li Qi, Salman Raied, Test Erik, Strack Robert, Kecman Vojislav (2013) Parallel multitask cross validation for support vector machine using gpu. J Parallel Distrib Comput 73(3):293–302
Article Google Scholar
Łojasiewicz S (1993) Sur la géométrie semi-et sous-analytique. Ann Inst Fourier (Grenoble) 43(5):1575–1595
Article MathSciNet MATH Google Scholar
Nesterov Y (2004) Introductory lectures on convex optimization vol 87. A basic course
Nesterov Y (2007) Gradient methods for minimizing composite objective function. CORE Discussion Papers
Ouyang H, He N, Tran L, Gray A (2013) Stochastic alternating direction method of multipliers. In: Proceedings of the 30th International Conference on Machine Learning, pp 80–88
Platt JC, Cristianini N, Shawe-Taylor J (2000) Large margin dags for multiclass classification. Adv Neural Inf Process Syst 12(3):547–553
Google Scholar
Qi Zhiquan, Tian Yingjie, Shi Yong (2013) Structural twin support vector machine for classification. Knowl-Based Syst 43:74–81
Article Google Scholar
Schmidt M, Roux NL, Bach F (2011) Convergence rates of inexact proximal-gradient methods for convex optimization. Arxiv preprint arXiv:1109.2415
Sturm JF (1999) Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optim Methods Softw 11(1–4):625–653
Article MathSciNet MATH Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58(1):267–288
MathSciNet MATH Google Scholar
Tsyurmasto P, Zabarankin M, Uryasev S (2014) Value-at-risk support vector machine: stability to outliers. J Comb Optim 1–15
Wang L, Shen X (2007) On ${L}_1$-norm multiclass support vector machines. J Am Stat Assoc 102(478):583–594
Article MATH Google Scholar
Wang L, Zhu J, Zou H (2006) The doubly regularized support vector machine. Stat Sin 16(2):589
MathSciNet MATH Google Scholar
Wang L, Zhu J, Zou H (2008) Hybrid huberized support vector machines for microarray classification and gene selection. Bioinformatics 24(3):412–419
Article Google Scholar
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull pp 80–83
Xu Y, Yin W (2013) A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J Imaging Sci 6(3):1758–1789
Article MathSciNet MATH Google Scholar
Xu Y, Yin W (2014) A globally convergent algorithm for nonconvex optimization based on block coordinate update. Arxiv preprint arXiv:1410.1386
Yang Yi, Zou Hui (2013) An efficient algorithm for computing the hhsvm and its generalizations. J Comput Graph Stat 22(2):396–415
Article MathSciNet Google Scholar
Ye GB, Chen Y, Xie X (2011) Efficient variable selection in support vector machines via the alternating direction method of multipliers. In: Proceedings of the International conference on artificial intelligence and statistics
Zhang H, Liu Y, Wu Y, Zhu J (2008) Variable selection for the multicategory svm via adaptive sup-norm regularization. Electron J Stat 2:149–167
Article MathSciNet MATH Google Scholar
Zhang Yang, Meratnia Nirvana, Havinga Paul JM (2013) Distributed online outlier detection in wireless sensor networks using ellipsoidal support vector machine. Ad Hoc Netw 11(3):1062–1074
Article Google Scholar
Zou J, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67(1):301–320
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

The authors would like to thank three anonymous referees and the associate editor for their valuable comments and suggestions that improved the quality of this paper. The first author would also like to thank Professor Wotao Yin for his valuable discussion.

Author information

Authors and Affiliations

Department of Computational and Applied Mathematics, Rice University, Houston, TX, USA
Yangyang Xu
Siemens Corporate Research, Princeton, NJ, USA
Ioannis Akrotirianakis & Amit Chakraborty

Authors

Yangyang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Ioannis Akrotirianakis
View author publications
You can also search for this author in PubMed Google Scholar
Amit Chakraborty
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ioannis Akrotirianakis.

Appendices

Appendix A: Proof of Proposition 1

First, we derive the convexity of each $f_i$ from the composition of the convex function $\phi _H$ and the linear transformation $y_i(b+{\mathbf{x}}_i^\top {\mathbf{w}})$ (e.g., see [4]). Thus, $f$ is also convex since it is the sum of $n$ convex functions. Second, it is easy to verify that

$$\begin{aligned}\phi '_H(t)=\left\{ \begin{array}{ll} 0,&\quad {\text{for}}\;\,t>1;\\ \frac{t-1}{\delta },&\quad {\text{for}}\;\,1-\delta <t\le 1;\\ -1,&\quad {\text {for}}\;\,t\le 1-\delta ; \end{array} \right. \end{aligned}$$

and it is Lipschitz continuous, namely $|\phi '_H(t)-\phi '_H(s)|\le \frac{1}{\delta }|t-s|,$ for any $t,s$.

Using the notations in (7) and by the chain rule, we have $\nabla f_i({\mathbf{u}})=\phi '_H({\mathbf{v}}_i^\top {\mathbf{u}}){\mathbf{v}}_i$, for $i=1,\ldots ,n$ and

$$\begin{aligned} \nabla f({\mathbf{u}})=\frac{1}{n}\sum _{i=1}^n\phi '_H({\mathbf{v}}_i^\top {\mathbf{u}}){\mathbf{v}}_i. \end{aligned}$$

(23)

Hence, for any ${\mathbf{u}},\hat{\mathbf{u}}$, we have

$$\begin{aligned}&\Vert \nabla f({\mathbf{u}})-\nabla f(\hat{\mathbf{u}})\Vert \\&\quad\le \frac{1}{n}\sum _{i=1}^n\left\| \nabla f_i({\mathbf{u}})-\nabla f_i(\hat{\mathbf{u}})\right\| \\&\quad=\frac{1}{n}\sum _{i=1}^n\left\| \left( \phi '({\mathbf{v}}_i^\top {\mathbf{u}})-\phi '({\mathbf{v}}_i^\top \hat{\mathbf{u}})\right) {\mathbf{v}}_i\right\| \\&\quad\le \frac{1}{n\delta }\sum _{i=1}^n|{\mathbf{v}}_i^\top ({\mathbf{u}}-\hat{\mathbf{u}})|\Vert {\mathbf{v}}_i\Vert \\&\quad\le \frac{1}{n\delta }\sum _{i=1}^n \Vert {\mathbf{v}}_i\Vert ^2\Vert {\mathbf{u}}-\hat{\mathbf{u}}\Vert , \end{aligned}$$

which completes the proof.

Appendix B: Proof of Theorem 1

We begin our analysis with the following lemma, which can be shown through essentially the same proof of Lemma 2.3 in [1].

Lemma 2

Let $\{\mathbf{u}^k\}$ be the sequence generated by (5) with $L_k\le L$ such that for all $k\ge 1$,

$$\begin{aligned} \xi _1({\mathbf{u}}^{k})&\le \xi _1(\hat{\mathbf{u}}^{k-1}) +\left\langle \nabla \xi _1(\hat{\mathbf{u}}^{k-1}), {\mathbf{u}}^{k}-\hat{\mathbf{u}}^{k-1}\right\rangle \nonumber \\&\quad +\frac{L_k}{2}\left\| {\mathbf{u}}^{k}-\hat{\mathbf{u}}^{k-1}\right\| ^2. \end{aligned}$$

(24)

Then for all $k\ge 1$, it holds for any ${\mathbf{u}}\in \mathcal {U}$ that

$$\begin{aligned} \xi ({\mathbf{u}})-\xi ({\mathbf{u}}^k)&\ge \frac{L_k}{2}\Vert {\mathbf{u}}^k-\hat{\mathbf{u}}^{k-1}\Vert ^2+\frac{\mu }{2}\Vert {\mathbf{u}}-{\mathbf{u}}^k\Vert ^2\nonumber \\&\quad +L_k\langle \hat{\mathbf{u}}^{k-1}-{\mathbf{u}}, {\mathbf{u}}^k-\hat{\mathbf{u}}^{k-1}\rangle . \end{aligned}$$

(25)

We also need the following lemma, which is the KL property for strongly convex functions:

Lemma 3

Suppose $\xi$ is strongly convex with constant $\mu >0$. Then for any ${\mathbf{u}},{\mathbf{v}}\in \mathrm {dom}(\partial \xi )$ and ${\mathbf{g}}_{\mathbf{u}}\in \partial \xi ({\mathbf{u}})$, we have

$$\begin{aligned} \xi ({\mathbf{u}})-\xi ({\mathbf{v}})\le \frac{1}{\mu }\Vert {\mathbf{g}}_{\mathbf{u}}\Vert ^2. \end{aligned}$$

(26)

Proof

For any ${\mathbf{g}}_{\mathbf{u}}\in \partial \xi ({\mathbf{u}})$, we have from the strong convexity of $\xi$ and the Cauchy-Schwarz inequality that

$$\begin{aligned} \xi ({\mathbf{u}})-\xi ({\mathbf{v}})\le \langle {\mathbf{g}}_{\mathbf{u}},{\mathbf{u}}-{\mathbf{v}}\rangle -\frac{\mu }{2}\Vert {\mathbf{u}}-{\mathbf{v}}\Vert ^2\le \frac{1}{\mu }\Vert {\mathbf{g}}_{\mathbf{u}}\Vert ^2, \end{aligned}$$

which completes the proof. $\square$

Global convergence

We first show ${\mathbf{u}}^k\rightarrow {\mathbf{u}}^*$. Letting ${\mathbf{u}}={\mathbf{u}}^{k-1}$ in (25) gives

$$\begin{aligned} &\xi ({\mathbf{u}}^{k-1})-\xi ({\mathbf{u}}^k)\\&\quad\ge \frac{L_k}{2}\Vert {\mathbf{u}}^k-\hat{\mathbf{u}}^{k-1}\Vert ^2+\frac{\mu }{2}\Vert {\mathbf{u}}^{k-1}-{\mathbf{u}}^k\Vert ^2\nonumber \\&\quad\quad +L_k\langle \hat{\mathbf{u}}^{k-1}-{\mathbf{u}}^{k-1}, {\mathbf{u}}^k-\hat{\mathbf{u}}^{k-1}\rangle \nonumber \\&\quad=\frac{L_k}{2}\Vert {\mathbf{u}}^k-{\mathbf{u}}^{k-1}\Vert ^2+\frac{\mu }{2}\Vert {\mathbf{u}}^k-{\mathbf{u}}^{k-1}\Vert ^2\nonumber \\&\quad\quad -\frac{L_k}{2}\omega _{k-1}^2\Vert {\mathbf{u}}^{k-1}-{\mathbf{u}}^{k-2}\Vert ^2\nonumber \\&\quad\ge \frac{L_k}{2}\Vert {\mathbf{u}}^k-{\mathbf{u}}^{k-1}\Vert ^2+\frac{\mu }{2}\Vert {\mathbf{u}}^k-{\mathbf{u}}^{k-1}\Vert ^2\\&\quad\quad -\frac{L_{k-1}}{2}\Vert {\mathbf{u}}^{k-1}-{\mathbf{u}}^{k-2}\Vert ^2.\nonumber \end{aligned}$$

(27)

Summing up the inequality (27) over $k$, we have

$$\begin{aligned}&\xi ({\mathbf{u}}^0)-\xi ({\mathbf{u}}^K)\\&\quad\ge \frac{\mu }{2}\sum _{k=1}^K\Vert {\mathbf{u}}^{k-1}-{\mathbf{u}}^k\Vert ^2 +\frac{L_K}{2}\Vert {\mathbf{u}}^K-{\mathbf{u}}^{K-1}\Vert ^2, \end{aligned}$$

which implies ${\mathbf{u}}^k-{\mathbf{u}}^{k-1}\rightarrow {\mathbf {0}}$ since $\xi ({\mathbf{u}}^k)$ is lower bounded.

Note that $\{{\mathbf{u}}^k\}$ is bounded since $\xi$ is coercive and $\{\xi ({\mathbf{u}}^k)\}$ is upper bounded by $\xi ({\mathbf{u}}^0)$. Hence, $\{{\mathbf{u}}^k\}$ has a limit point $\bar{\mathbf{u}}$, so there is a subsequence $\{{\mathbf{u}}^{k_j}\}$ converging to $\bar{\mathbf{u}}$. Note ${\mathbf{u}}^{k_j+1}$ also converges to $\bar{\mathbf{u}}$ since ${\mathbf{u}}^k-{\mathbf{u}}^{k-1}\rightarrow {\mathbf {0}}$. Without loss of generality, we assume $L_{k_j+1}\rightarrow \bar{L}$. Otherwise, we can take a convergent subsequence of $\{L_{k_j+1}\}$. By (5), we have

$$\begin{aligned} {\mathbf{u}}^{k_j+1}=&\arg \underset{\mathbf{u}}{\min } \xi _1(\hat{\mathbf{u}}^{k_j})+ \langle \nabla \xi _1(\hat{\mathbf{u}}^{k_j}),{\mathbf{u}}-\hat{\mathbf{u}}^{k_j}\rangle \\&+\frac{L_{k_j+1}}{2}\Vert {\mathbf{u}}-\hat{\mathbf{u}}^{k_j}\Vert ^2+\xi _2({\mathbf{u}}). \end{aligned}$$

Letting $j\rightarrow \infty$ in the above formula and observing $\hat{\mathbf{u}}^{k_j}\rightarrow \bar{\mathbf{u}}$ yield

$$\begin{aligned} \bar{\mathbf{u}}=\arg \underset{\mathbf{u}}{\min }&\xi _1(\bar{\mathbf{u}})+ \langle \nabla \xi _1(\bar{\mathbf{u}}),\mathbf{u}-\bar{\mathbf{u}}\rangle +\frac{\bar{L}}{2}\Vert {\mathbf{u}}-\bar{\mathbf{u}}\Vert ^2+\xi _2(\mathbf{u}), \end{aligned}$$

which indicates $-\nabla \xi _1(\bar{\mathbf{u}})\in \partial \xi _2(\bar{\mathbf{u}})$. Thus $\bar{\mathbf{u}}$ is the minimizer of $\xi$.

Since $\{\xi ({\mathbf{u}}^k)\}$ is nonincreasing and lower bounded, it converges to $\xi (\bar{\mathbf{u}})=\xi (\mathbf{u}^*)$. Noting $\frac{\mu }{2}\Vert {\mathbf{u}}^k-{\mathbf{u}}^*\Vert ^2\le \xi ({\mathbf{u}}^k)-\xi ({\mathbf{u}}^*),$ we have ${\mathbf{u}}^k\rightarrow {\mathbf{u}}^*.$

Convergence rate

Next we go to show (19). Without loss of generality, we assume $\xi ({\mathbf{u}}^*)=0$. Otherwise, we can consider $\xi -\xi ({\mathbf{u}}^*)$ instead of $\xi$. In addition, we assume $\xi ({\mathbf{u}}^k)>\xi ({\mathbf{u}}^*)$ for all $k\ge 0$ because if $\xi ({\mathbf{u}}^{k_0})=\xi ({\mathbf{u}}^*)$ for some $k_0$, then ${\mathbf{u}}^k={\mathbf{u}}^{k_0}={\mathbf{u}}^*$ for all $k\ge k_0$.

For ease of description, we denote $\xi _k=\xi ({\mathbf{u}}^k)$. Letting ${\mathbf{v}}={\mathbf{u}}^*$ in (26), we have

$$\begin{aligned} \sqrt{\xi _k}\le \frac{1}{\sqrt{\mu }}\Vert {\mathbf{g}}_{{\mathbf{u}}^k}\Vert , \quad {\text {for\; all}}\quad{\mathbf{g}}_{{\mathbf{u}}^k}\in \partial \xi ({\mathbf{u}}^k). \end{aligned}$$

(28)

Noting $-\nabla \xi _1(\hat{\mathbf{u}}^{k-1})-L_k({\mathbf{u}}^k-\hat{\mathbf{u}}^{k-1})+\nabla \xi _1({\mathbf{u}}^k)\in \partial \xi ({\mathbf{u}}^k)$ and $L_k\le L$, we have for all $k\ge K$,

$$\begin{aligned} \sqrt{\xi _k}\le \frac{2L}{\sqrt{\mu }}\left( \Vert {\mathbf{u}}^k-{\mathbf{u}}^{k-1}\Vert +\Vert {\mathbf{u}}^{k-1}-{\mathbf{u}}^{k-2}\Vert \right) \end{aligned}$$

(29)

Noting $\sqrt{\xi _k}-\sqrt{\xi _{k+1}}\ge \frac{\xi _k-\xi _{k+1}}{2\sqrt{\xi _k}}$ and using (27) yield

$$\begin{aligned}&(L_{k+1}+\mu )\Vert {\mathbf{u}}^{k}-{\mathbf{u}}^{k+1}\Vert ^2\\ &\quad\le L_k\Vert {\mathbf{u}}^{k-1}-{\mathbf{u}}^{k}\Vert ^2+\frac{8L}{\sqrt{\mu }}\left( \sqrt{\xi _k}-\sqrt{\xi _{k+1}}\right) \Vert {\mathbf{u}}^k-{\mathbf{u}}^{k-1}\Vert \\&\qquad+\frac{8L}{\sqrt{\mu }}\left( \sqrt{\xi _k}-\sqrt{\xi _{k+1}}\right) \Vert {\mathbf{u}}^{k-1}-{\mathbf{u}}^{k-2}\Vert , \end{aligned}$$

after rearrangements. Take $0<\delta <\frac{1}{2}(\sqrt{L+\mu }-\sqrt{L})$. Using the inequalities $a^2+b^2\le (a+b)^2$ and $\sqrt{ab}\le ta+\frac{1}{4t}b$ for $a,b,t>0$, we have from the above inequality that

$$\begin{aligned}&\sqrt{L_{k+1}+\mu }\Vert {\mathbf{u}}^{k}-{\mathbf{u}}^{k+1}\Vert \\ &\quad\le \sqrt{L_k}\Vert {\mathbf{u}}^{k-1}-{\mathbf{u}}^{k}\Vert +\frac{2L}{\delta \sqrt{\mu }}\left( \sqrt{\xi _k}-\sqrt{\xi _{k+1}}\right) \\&\qquad+\delta \left( \Vert {\mathbf{u}}^k-{\mathbf{u}}^{k-1}\Vert +\Vert {\mathbf{u}}^{k-1}-{\mathbf{u}}^{k-2}\Vert \right) . \end{aligned}$$

Summing up the above inequality over $k$ and noting $\Vert {\mathbf{u}}^k-{\mathbf{u}}^{k+1}\Vert \rightarrow 0, \xi _k\rightarrow 0$ yield

$$\begin{aligned}&\sum _{k=m}^\infty \left( \sqrt{L_{k+1}+\mu }-\sqrt{L_{k+1}}-2\delta \right) \Vert {\mathbf{u}}^k-{\mathbf{u}}^{k+1}\Vert \\ \le&\left( \sqrt{L}+2\delta \right) \Vert {\mathbf{u}}^{m-1}-{\mathbf{u}}^m\Vert +\delta \Vert {\mathbf{u}}^{m-2}-{\mathbf{u}}^{m-1}\Vert +\frac{2L}{\delta \sqrt{\mu }}\sqrt{\xi _m}, \end{aligned}$$

which together with $\sqrt{L+\mu }-\sqrt{L}\le \sqrt{L_k+\mu }-\sqrt{L_k}$ for all $k\ge 0$ implies

$$\begin{aligned}&\sum _{k=m}^\infty \Vert {\mathbf{u}}^k-{\mathbf{u}}^{k+1}\Vert \nonumber \\ \le&C_1 \sqrt{\xi _m}+C_2\left( \Vert {\mathbf{u}}^{m-1}-{\mathbf{u}}^m\Vert +\Vert {\mathbf{u}}^{m-2}-{\mathbf{u}}^{m-1}\Vert \right) , \end{aligned}$$

(30)

where

$$\begin{aligned} C_1=\frac{2L}{\delta \sqrt{\mu }\left( \sqrt{L+\mu }-\sqrt{L}-2\delta \right) },\quad C_2=\frac{\sqrt{L}+2\delta }{\sqrt{L+\mu }-\sqrt{L}-2\delta }. \end{aligned}$$

Denote $S_m=\sum _{k=m}^\infty \Vert {\mathbf{u}}^k-{\mathbf{u}}^{k+1}\Vert$ and write (30) as

$$\begin{aligned} S_m\le C_1 \sqrt{\xi _m}+C_2(S_{m-2}-S_m), \end{aligned}$$

which together with (29) gives

$$\begin{aligned} S_m & \le \left( C_1\frac{2L}{\sqrt{\mu }}+C_2\right) (S_{m-2}-S_m)\\ &= C_3(S_{m-2}-S_m),\text { for\; all }\, m\ge 2. \end{aligned}$$

Let $\tau =\sqrt{\frac{C_3}{1+C_3}}$ and $C=S_0$. Then we have $S_m\le C\tau ^m, \forall m\ge 0.$ Note $\Vert {\mathbf{u}}^m-{\mathbf{u}}^*\Vert \le S_m$, and thus we complete the proof.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, Y., Akrotirianakis, I. & Chakraborty, A. Proximal gradient method for huberized support vector machine. Pattern Anal Applic 19, 989–1005 (2016). https://doi.org/10.1007/s10044-015-0485-z

Download citation

Received: 16 June 2013
Accepted: 07 May 2015
Published: 31 May 2015
Issue Date: November 2016
DOI: https://doi.org/10.1007/s10044-015-0485-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Bulgaria)

Instant access to the full article PDF.

Institutional subscriptions

Proximal gradient method for huberized support vector machine

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Alternating Direction Method of Multipliers for Regularized Multiclass Support Vector Machines

Sparse Proximal Support Vector Machine with a Specialized Interior-Point Method

A proximal quadratic surface support vector machine for semi-supervised binary classification

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Proof of Proposition 1

Appendix B: Proof of Theorem 1

Lemma 2

Lemma 3

Proof

Global convergence

Convergence rate

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Proximal gradient method for huberized support vector machine

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Alternating Direction Method of Multipliers for Regularized Multiclass Support Vector Machines

Sparse Proximal Support Vector Machine with a Specialized Interior-Point Method

A proximal quadratic surface support vector machine for semi-supervised binary classification

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Proof of Proposition 1

Appendix B: Proof of Theorem 1

Lemma 2

Lemma 3

Proof

Global convergence

Convergence rate

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation