poster

Reproducible and Reliable Distributed Classification of Text Streams

Authors:

Boris NovikovAuthors Info & Claims

DEBS '19: Proceedings of the 13th ACM International Conference on Distributed and Event-based Systems

Pages 264 - 265

https://doi.org/10.1145/3328905.3332514

Published: 24 June 2019 Publication History

Get Access

Abstract

Large-scale classification of text streams is an essential problem that is hard to solve. Batch processing systems are scalable and proved their effectiveness for machine learning but do not provide low latency. On the other hand, state-of-the-art distributed stream processing systems are able to achieve low latency but do not support the same level of fault tolerance and determinism. In this work, we discuss how the distributed streaming computational model and fault tolerance mechanisms can affect the correctness of text classification data flow. We also propose solutions that can mitigate the revealed pitfalls.

References

[1]

Apache Storm 2017. Apache Storm. (Oct. 2017). http://storm.apache.org/

Google Scholar

[2]

Paris Carbone, Stephan Ewen, Gyula Fóra, Seif Haridi, Stefan Richter, and Kostas Tzoumas. 2017. State Management in Apache Flink®: Consistent Stateful Distributed Stream Processing. Proc. VLDB 10, 12 (Aug. 2017), 1718--1729.

Digital Library

Google Scholar

[3]

Jay Kreps, Neha Narkhede, Jun Rao, et al. 2011. Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB. 1--7.

Google Scholar

[4]

Igor E. Kuralenok, Artem Trofimov, Nikita Marshalkin, and Boris Novikov. 2018. FlameStream: Model and Runtime for Distributed Stream Processing. In Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR'18). ACM, New York, NY, USA, Article 8, 2 pages.

Digital Library

Google Scholar

[5]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.

Digital Library

Google Scholar

[6]

Alexey Svyatkovskiy, Kosuke Imai, Mary Kroeger, and Yuki Shiraito. 2016. Large-scale text processing pipeline with Apache Spark. In 2016 IEEE International Conference on Big Data (Big Data). IEEE, 3928--3935.

Crossref

Google Scholar

Cited By

View all

Index Terms

Reproducible and Reliable Distributed Classification of Text Streams
1. Information systems

Recommendations

Distributed Classification of Text Streams: Limitations, Challenges, and Solutions
BIRTE 2019: Proceedings of Real-Time Business Intelligence and Analytics

Text stream classification is an important problem that is difficult to solve at scale. Batch processing systems, widely adopted for text classification tasks, cannot provide for low latency. Distributed stream processing systems can offer low latency, ...
On demand classification of data streams
KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

Current models of the classification problem do not effectively handle bursts of particular classes coming in at different times. In fact, the current model of the classification problem simply concentrates on methods for one-pass classification ...
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

DEBS '19: Proceedings of the 13th ACM International Conference on Distributed and Event-based Systems

June 2019

291 pages

ISBN:9781450367943

DOI:10.1145/3328905

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2019

Check for updates

Badges

Best Poster

Author Tags

Qualifiers

Poster
Research
Refereed limited

Conference

DEBS '19

Sponsor:

DEBS '19: The 13th ACM International Conference on Distributed and Event-based Systems

June 24 - 28, 2019

Darmstadt, Germany

Acceptance Rates

DEBS '19 Paper Acceptance Rate 13 of 47 submissions, 28%;

Overall Acceptance Rate 145 of 583 submissions, 25%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
84
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Distributed Classification of Text Streams: Limitations, Challenges, and Solutions

On demand classification of data streams

Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values