Fast and scalable vector similarity joins with MapReduce

Yang, Byoungju; Kim, Hyun Joon; Shim, Junho; Lee, Dongjoo; Lee, Sang-goo

doi:10.1007/s10844-015-0363-6

Fast and scalable vector similarity joins with MapReduce

Published: 22 May 2015

Volume 46, pages 473–497, (2016)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Byoungju Yang¹,
Hyun Joon Kim²,
Junho Shim³,
Dongjoo Lee⁴ &
…
Sang-goo Lee²

479 Accesses
3 Citations
Explore all metrics

Abstract

Vector similarity join, which finds similar pairs of vector objects, is a computationally expensive process. As its number of vectors increases, the time needed for join operation increases proportional to the square of the number of vectors. Various filtering techniques have been proposed to reduce its computational load. On the other hand, MapReduce algorithms have been studied to manage large datasets. The recent improvements, however, still suffer from its computational time and scalability. In this paper, we propose a MapReduce algorithm FACET(FAst and sCalable maprEduce similariTy join) to efficiently solve the vector similarity join problem on large datasets. FACET is an all-pair exact join algorithm, composed of two stages. In the first stage, we use our own novel filtering techniques to eliminate dissimilar pairs to generate non-redundant candidate pairs. The second stage matches candidate pairs with the vector data so that similar pairs are produced as the output. Both stages employ parallelism offered by MapReduce. The algorithm is currently designed for cosine similarity and Self Join case. Extensions to other similarity measures and R-S Join case are also discussed. We provide the I/O analysis of the algorithm. We evaluate the performance of the algorithm on multiple real world datasets. The experiment results show that our algorithm performs, on average, 40 % upto 800 % better than the previous state-of-the-art MapReduce algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Bulgaria)

Instant access to the full article PDF.

Institutional subscriptions

Multidimensional Similarity Join Using MapReduce

An efficient MapReduce algorithm for similarity join in metric spaces

Article 06 February 2016

Comparing MapReduce-Based k-NN Similarity Joins on Hadoop for High-Dimensional Data

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

The algorithm is referred to as VCL after the names of the authors, Vernica, Carey and Li
∥x∥ is a L2−N o r m of the vector x.
Identity-Mapper does nothing and just partition the input data into reducer by K e y.

References

Baraglia, R., De Francisci Morales, G., & Lucchese, C. (2010). Document similarity self-join with mapreduce. In Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM ’10, pp. 731–736. doi:10.1109/ICDM.2010.70.
Bayardo, R.J., Ma, Y., & Srikant, R. (2007). Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web, WWW ’07, pp. 131–140. doi:10.1145/1242572.1242591. New York: ACM.
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., & Tian, Y. (2010). A comparison of join algorithms for log processing in mapreduce. In Proceedings of the 2010 ACM SIGMOD international conference on Management of data, SIGMOD ’10, pp. 975–986.
Böhm, C., Kriegel, H.P., & Seidl, T. (2002). Combining approximation techniques and vector quantization for adaptable similarity search. Journal of Intelligent Information System, 19(2), 207–230.
Article Google Scholar
Chakrabarti, K., Chaudhuri, S., Ganti, V., & Xin, D. (2008). An efficient filter for approximate membership checking. In SIGMOD ’08, pp. 805–818.
Chaudhuri, S., Ganti, V., & Kaushik, R. (2006). A primitive operator for similarity joins in data cleaning. In Proceedings of the 22nd International Conference on Data Engineering, ICDE ’06, pp. 5. doi:10.1109/ICDE.2006.9.
Conrad, J.G., Guo, X.S., & Schriber, C.P. (2003). Online duplicate document detection: signature reliability in a dynamic retrieval environment. In Proceedings of the 12th international conference on Information and knowledge management, CIKM ’03, pp. 443–452. doi:10.1145/956863.956946.
Dean, J., & Ghemawat, S. (2008). Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113. doi:10.1145/1327452.1327492.
Article Google Scholar
Dooms, S., Audenaert, P., Fostier, J., Pessemier, T.D., & Martens, L. (2013). In-memory, distributed content-based recommender system. Journal of Intelligent Information System. doi:10.1007/s10844-013-0276-1.
GroupLens (2011). Movielens data sets, grouplens research. http://www.grouplens.org/node/73.
Henzinger, M. (2006). Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’06, pp. 284–291. doi:10.1145/1148170.1148222.
Koren, Y. (2010). Collaborative filtering with temporal dynamics. Communications of the ACM, 4, 89–97. doi:10.1145/1721654.1721677.
Article Google Scholar
Lee, D. (2011). An efficient filtering framework for vector similarity joins. Ph.D. thesis, Seoul National University, Seoul, South Korea.
Lee, D., Park, J., Shim, J., & Lee, S.g. (2010). An efficient similarity join algorithm with cosine similarity predicate. In Proceedings of the 21st International Conference on Database and Expert Systems Applications: Part II, DEXA’10, pp. 422–436. Berlin: Springer. http://dl.acm.org/citation.cfm?id=1887568.1887611.
Leskovec, J. (2012). Stanford network analysis project. http://snap.stanford.edu/.
Lowe, D.G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110. doi:10.1023/B:VISI.0000029664.99615.94.
Article Google Scholar
Metwally, A., & Faloutsos, C. (2012). V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proceedings of the VLDB Endow, 5(8), 704–715. http://dl.acm.org/citation.cfm?id=2212351.2212353.
Article Google Scholar
Nistér, D., & Stewénius, H. (2006). Scalable recognition with a vocabulary tree. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 2161–2168.
Ribeiro, L.A., & Harder, T. (2011). Generalizing prefix filtering to improve set similarity joins. Information Systems, 36(1), 62–78. doi:10.1016/j.is.2010.07.003. http://www.sciencedirect.com/science/article/B6V0G-50GMMMG-3/2/f121cda9a0ee1ff6738138a90ff2d488.
Article Google Scholar
Ruan, J., & Zhang, W. (2007). An efficient spectral algorithm for network community discovery and its applications to biological and social networks. In Seventh IEEE International Conference on Data Mining, ICDM 2007, pp. 643–648: IEEE.
Satuluri, V., & Parthasarathy, S. (2009). Scalable graph clustering using stochastic flows: applications to community discovery. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’09, pp. 737–746. doi:10.1145/1557019.1557101. New York: ACM.
Su, X., & Khoshgoftaar, T.M. (2009). A survey of collaborative filtering techniques. Advanced Artificial Intelligence, 2009, 4:2–4:2. doi:10.1155/2009/421425.
Article Google Scholar
Vernica, R., Carey, M.J., & Li, C. (2010). Efficient parallel set-similarity joins using mapreduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, SIGMOD ’10, pp. 495–506. doi:10.1145/1807167.1807222. New York: ACM.
Xiao, C., Wang, W., Lin, X., Yu, J.X., & Wang, G. (2011). Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems, 36(3), 15:1–15:41. doi:10.1145/2000824.2000825.
Article Google Scholar
Yang, B., Myung, J., Lee, S.g., & Lee, D. (2013). A mapreduce-based filtering algorithm for vector similarity join. In Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication, ICUIMC ’13, pp. 71:1–71:5. doi:10.1145/2448556.2448627. New York: ACM.

Download references

Acknowledgements

The authors would like to thank Hanbit Lee and Jaeseok Myung for their help in performing the experiment and in discussing about the performance improvement.

Author information

Authors and Affiliations

Database Server Technology, Oracle Corporation, Seoul, 135-798, Korea
Byoungju Yang
School of Computer Science and Engineering, Seoul National University, Seoul, 151-742, Korea
Hyun Joon Kim & Sang-goo Lee
Department of Computer Science, Sookmyung Women’s University, Seoul, 140-742, Korea
Junho Shim
Software R&D Center, Samsung Electronics Co., Suwon, 443-742, Korea
Dongjoo Lee

Authors

Byoungju Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hyun Joon Kim
View author publications
You can also search for this author in PubMed Google Scholar
Junho Shim
View author publications
You can also search for this author in PubMed Google Scholar
Dongjoo Lee
View author publications
You can also search for this author in PubMed Google Scholar
Sang-goo Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junho Shim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, B., Kim, H.J., Shim, J. et al. Fast and scalable vector similarity joins with MapReduce. J Intell Inf Syst 46, 473–497 (2016). https://doi.org/10.1007/s10844-015-0363-6

Download citation

Received: 17 January 2014
Revised: 27 April 2015
Accepted: 28 April 2015
Published: 22 May 2015
Issue Date: June 2016
DOI: https://doi.org/10.1007/s10844-015-0363-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Bulgaria)

Instant access to the full article PDF.

Institutional subscriptions

Fast and scalable vector similarity joins with MapReduce

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multidimensional Similarity Join Using MapReduce

An efficient MapReduce algorithm for similarity join in metric spaces

Comparing MapReduce-Based k-NN Similarity Joins on Hadoop for High-Dimensional Data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Fast and scalable vector similarity joins with MapReduce

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multidimensional Similarity Join Using MapReduce

An efficient MapReduce algorithm for similarity join in metric spaces

Comparing MapReduce-Based k-NN Similarity Joins on Hadoop for High-Dimensional Data

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation