skip to main content
10.1145/3328905.3332514acmconferencesArticle/Chapter ViewAbstractPublication PagesdebsConference Proceedingsconference-collections
poster

Reproducible and Reliable Distributed Classification of Text Streams

Published: 24 June 2019 Publication History

Abstract

Large-scale classification of text streams is an essential problem that is hard to solve. Batch processing systems are scalable and proved their effectiveness for machine learning but do not provide low latency. On the other hand, state-of-the-art distributed stream processing systems are able to achieve low latency but do not support the same level of fault tolerance and determinism. In this work, we discuss how the distributed streaming computational model and fault tolerance mechanisms can affect the correctness of text classification data flow. We also propose solutions that can mitigate the revealed pitfalls.

References

[1]
Apache Storm 2017. Apache Storm. (Oct. 2017). http://storm.apache.org/
[2]
Paris Carbone, Stephan Ewen, Gyula Fóra, Seif Haridi, Stefan Richter, and Kostas Tzoumas. 2017. State Management in Apache Flink®: Consistent Stateful Distributed Stream Processing. Proc. VLDB 10, 12 (Aug. 2017), 1718--1729.
[3]
Jay Kreps, Neha Narkhede, Jun Rao, et al. 2011. Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB. 1--7.
[4]
Igor E. Kuralenok, Artem Trofimov, Nikita Marshalkin, and Boris Novikov. 2018. FlameStream: Model and Runtime for Distributed Stream Processing. In Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR'18). ACM, New York, NY, USA, Article 8, 2 pages.
[5]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
[6]
Alexey Svyatkovskiy, Kosuke Imai, Mary Kroeger, and Yuki Shiraito. 2016. Large-scale text processing pipeline with Apache Spark. In 2016 IEEE International Conference on Big Data (Big Data). IEEE, 3928--3935.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
DEBS '19: Proceedings of the 13th ACM International Conference on Distributed and Event-based Systems
June 2019
291 pages
ISBN:9781450367943
DOI:10.1145/3328905
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2019

Check for updates

Badges

  • Best Poster

Author Tags

  1. Data streams
  2. exactly once
  3. reproducibility
  4. text classification

Qualifiers

  • Poster
  • Research
  • Refereed limited

Conference

DEBS '19

Acceptance Rates

DEBS '19 Paper Acceptance Rate 13 of 47 submissions, 28%;
Overall Acceptance Rate 145 of 583 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 84
    Total Downloads
  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media