Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- invited-talkAugust 2018
Can Deep Learning Compensate for a Shallow Evaluation?
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 5, Page 1https://doi.org/10.1145/3209280.3236023The last ten years have witnessed an enormous increase in the application of "deep learning" methods to both spoken and textual natural language processing. Have they helped? With respect to some well-defined tasks such as language modelling and ...
- abstractAugust 2018
Document Changes: Modeling, Detection, Storage and Visualization (DChanges 2018)
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 3, Pages 1–2https://doi.org/10.1145/3209280.3232792The DChanges series of workshops focuses on changes in all their aspects and applications: algorithms to detect changes, models to describe them and techniques to present them to the users are only some of the topics that are investigated. This year, we ...
- tutorialAugust 2018
Automatic Text Summarization and Classification
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 1, Pages 1–2https://doi.org/10.1145/3209280.3232791In this tutorial, we consider important aspects (algorithms, approaches, considerations) for tagging both unstructured and structured text for downstream use. This includes summarization, in which text information is compressed for more efficient ...
- invited-talkAugust 2018
The Quest for Total Recall
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 6, Pages 1–2https://doi.org/10.1145/3209280.3232788The objective of high-recall information retrieval (HRIR) is to identify substantially all information relevant to an information need, where the consequences of missing or untimely results may have serious legal, policy, health, social, safety, defence,...
- short-paperAugust 2018
ARCHANGEL: Trusted Archives of Digital Public Documents
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 31, Pages 1–4https://doi.org/10.1145/3209280.3229120We present ARCHANGEL; a decentralised platform for ensuring the long-term integrity of digital documents stored within public archives. Document integrity is fundamental to public trust in archives. Yet currently that trust is built upon institutional ...
-
- short-paperAugust 2018
Main Content Detection in HTML Journal Articles
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 36, Pages 1–4https://doi.org/10.1145/3209280.3229115Web content extraction algorithms have been shown to improve the performance of web content analysis tasks. This is because noisy web page content, such as advertisements and navigation links, can significantly degrade performance. This paper presents a ...
- short-paperAugust 2018
Text Mining and Recommender Systems for Predictive Policing
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 15, Pages 1–4https://doi.org/10.1145/3209280.3229112We present some results from a joint project between HP Labs, Cardiff University and Dyfed Powys Police on predictive policing. Applications of the various techniques from recommender systems and text mining to the problem of crime patterns recognition ...
- short-paperAugust 2018
Query Expansion in Enterprise Search
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 33, Pages 1–4https://doi.org/10.1145/3209280.3229111Although web search remains an active research area, interest in enterprise search has not kept up with the information requirements of the contemporary workforce. To address these issues, this research aims to develop, implement, and study the query ...
- short-paperAugust 2018
The Causal Graph CRDT for Complex Document Structure
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 34, Pages 1–4https://doi.org/10.1145/3209280.3229110Commutative Replicated Data Types (CRDTs) are an emerging tool for real-time collaborative editing. Existing work on CRDTs mostly focuses on documents as a list of text content, but large documents (having over 7,000 pages) with complex sectional ...
- short-paperAugust 2018
Document clustering as a record linkage problem
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 39, Pages 1–4https://doi.org/10.1145/3209280.3229109This work examines document clustering as a record linkage problem, focusing on named-entities and frequent terms, using several vector and graph-based document representation methods and k-means clustering with different similarity measures. The JedAI ...
- short-paperAugust 2018
SlideDiff: Animating Textual and Media Changes in Slides
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 37, Pages 1–4https://doi.org/10.1145/3209280.3229107SlideDiff is a system that automatically creates an animated rendering of textual and media differences between two versions of a slide presentation. While previous work focused on either textual or image data, SlideDiff integrates both text and media ...
- short-paperAugust 2018
Measuring the Centrality of the References in Scientific Papers
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 44, Pages 1–4https://doi.org/10.1145/3209280.3229104Citation analysis is considered as major and one of the most popular branches of bibliometrics. Citation analysis is based on the assumption that all citations have similar values and weights each equally. Specific research fields like content-based ...
- short-paperAugust 2018
Helmholtz Principle on word embeddings for automatic document segmentation
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 40, Pages 1–4https://doi.org/10.1145/3209280.3229103Automatic document segmentation gets more and more attention in the natural language processing field. The problem is defined as text division into lexically coherent fragments. In fact, most of realistic documents are not homogeneous, so extracting ...
- short-paperAugust 2018
Annotation Data Management with JeDIS
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 42, Pages 1–4https://doi.org/10.1145/3209280.3229102This paper introduces the Jena Document Information System (JeDIS). The focus lies on its capability to partition annotation graphs into modules. Annotation modules are defined in terms of types from the annotation schema. Modules allow easy ...
- short-paperAugust 2018
Automatic Term Extraction in Technical Domain using Part-of-Speech and Common-Word Features
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 51, Pages 1–4https://doi.org/10.1145/3209280.3229100Extracting key terms from technical documents allows us to write effective documentation that is specific and clear, with minimum ambiguity and confusion caused by nearly synonymous but different terms. For instance, in order to avoid confusion, the ...
- short-paperAugust 2018
GOWDA: Goal-oriented Web Documents Querying tool
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 47, Pages 1–4https://doi.org/10.1145/3209280.3229099Each day, a vast amount of data is published on the web. In addition, the rate at which content is being published is growing, which has the potential to overwhelm users, particularly those who are technically unskilled. Furthermore, users from various ...
- short-paperAugust 2018
Semantically Weighted Similarity Analysis for XML-based Content Components
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 20, Pages 1–4https://doi.org/10.1145/3209280.3229098Uncontrolled variants and duplicate content are ongoing problems in component content management; they decrease the overall reuse of content components. Similarity analyses can help to clean up existing databases and identify problematic texts, however, ...
- short-paperAugust 2018
diffi: diff improved; a preview
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 38, Pages 1–4https://doi.org/10.1145/3209280.3229084diffi (diff improved) is a comparison tool whose primary goal is to describe the differences between the content of two documents regardless of their formats.
diffi examines the stacks of abstraction levels of the two documents to be compared, finds ...
- research-articleAugust 2018
iDocChip: A Configurable Hardware Architecture for Historical Document Image Processing: Percentile Based Binarization
- Vladimir Rybalkin,
- Syed Saqib Bukhari,
- Muhammad Mohsin Ghaffar,
- Aqib Ghafoor,
- Norbert Wehn,
- Andreas Dengel
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 24, Pages 1–8https://doi.org/10.1145/3209280.3209538End-to-end Optical Character Recognition (OCR) systems are heavily used to convert document images into machine-readable text. Commercial and open-source OCR systems (like Abbyy, OCRopus, Tesseract etc.) have traditionally been optimized for contemporary ...
- research-articleAugust 2018
Exploiting patterns and templates for technical documentation
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Article No.: 30, Pages 1–9https://doi.org/10.1145/3209280.3209537There are several domains in which the documents are made of reusable pieces. Template languages have been widely studied by the document engineering community to deal with common structures and textual fragments. Though, templating mechanisms are often ...