Finding and Extracting Data Records from Web Pages

Álvarez, Manuel; Pan, Alberto; Raposo, Juan; Bellas, Fernando; Cacheda, Fidel

doi:10.1007/978-3-540-77092-3_41

Manuel Álvarez¹,
Alberto Pan¹,
Juan Raposo¹,
Fernando Bellas¹ &
…
Fidel Cacheda¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4808))

Included in the following conference series:

International Conference on Embedded and Ubiquitous Computing

1178 Accesses
3 Citations

Abstract

Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot process these data in such powerful manners as information from databases. We propose a set of novel techniques for detecting structured records in a web page and extracting the data values that constitute them. Our method needs only an input page. It starts by identifying the data region of interest in the page. Then it is partitioned into records by using a clustering method that groups similar subtrees in the DOM tree of the page. Finally, the attributes of the data records are extracted by using a method based on multiple string alignment. We have tested our techniques with a high number of real web sources, obtaining high precision and recall values.

This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730.

Download to read the full chapter text

Chapter PDF

Discovering Informative Contents of Web Pages

A survey of methods for the extraction of information from Web resources

Article 16 September 2016

Web Page Representations and Data Extraction with BERyL

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: Proc. of the ACM SIGMOD Int. Conf. on Management of Data (2003)
Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proc. of Very Large DataBases (VLDB) (2001)
Google Scholar
Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2003)
Google Scholar
Chang, C., Lui, S.: IEPAD: Information extraction based on pattern discovery. In: Proc. of 2001 Int. World Wide Web Conf., pp. 681–688 (2001)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: ROADRUNNER: Towards automatic data extraction from large web sites. In: Proc. of the 2001 Int. VLDB Conf., pp. 109–118 (2001)
Google Scholar
Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New Indices for Text: Pat trees and Pat Arrays. Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs (1992)
Google Scholar
Laender, A.H.F., Ribeiro-Neto, B.A., Soares da Silva, A., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record 31(2), 84–93 (2002)
Article Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707–710 (1966)
MathSciNet Google Scholar
Muslea, I., Minton, S., Knoblock, C.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems, 93–114 (2001)
Google Scholar
Notredame, C.: Recent Progresses in Multiple Sequence Alignment: A Survey. Technical report, Information Genetique et (2002)
Google Scholar
Pan, A., et al.: Semi-Automatic Wrapper Generation for Commercial Web Sources. In: Proc. of IFIP WG8.1 Conf. on Engineering Inf. Systems in the Internet Context (EISIC) (2002)
Google Scholar
Raposo, J., Pan, A., Álvarez, M., Hidalgo, J.: Automatically Maintaining Wrappers for Web Sources. Data & Knowledge Engineering 61(2), 331–358 (2007)
Article Google Scholar
Zhai, Y., Liu, B.: Extracting Web Data Using Instance-Based Learning. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 318–331. Springer, Heidelberg (2005)
Chapter Google Scholar
Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information and Communications Technologies, University of A Coruña, Campus de Elviña s/n. 15071. A Coruña, Spain
Manuel Álvarez, Alberto Pan, Juan Raposo, Fernando Bellas & Fidel Cacheda

Authors

Manuel Álvarez
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Pan
View author publications
You can also search for this author in PubMed Google Scholar
Juan Raposo
View author publications
You can also search for this author in PubMed Google Scholar
Fernando Bellas
View author publications
You can also search for this author in PubMed Google Scholar
Fidel Cacheda
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Tei-Wei Kuo Edwin Sha Minyi Guo Laurence T. Yang Zili Shao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Álvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F. (2007). Finding and Extracting Data Records from Web Pages. In: Kuo, TW., Sha, E., Guo, M., Yang, L.T., Shao, Z. (eds) Embedded and Ubiquitous Computing. EUC 2007. Lecture Notes in Computer Science, vol 4808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77092-3_41

Download citation

DOI: https://doi.org/10.1007/978-3-540-77092-3_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77091-6
Online ISBN: 978-3-540-77092-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Finding and Extracting Data Records from Web Pages

Abstract

Chapter PDF

Similar content being viewed by others

Discovering Informative Contents of Web Pages

A survey of methods for the extraction of information from Web resources

Web Page Representations and Data Extraction with BERyL

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Finding and Extracting Data Records from Web Pages

Abstract

Chapter PDF

Similar content being viewed by others

Discovering Informative Contents of Web Pages

A survey of methods for the extraction of information from Web resources

Web Page Representations and Data Extraction with BERyL

Keywords

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation