Skip to content

WDC probabalistic database bechmark for Dubio database

Notifications You must be signed in to change notification settings

utwente-dmb/wdc_pdb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

This README is copied from the old repository for version 1 of the wdc dataset. An updated README is expected soon.

VERSION 1

This is a git repository from the Twente University Gitlab server. To clone this repository do:

git clone git@git.snt.utwente.nl:flokstra/wdc-data-converter.git

Installation

Use the package manager pip to install the Postgres connection package psycopg2

pip3 install psycopg2

To connect to the database copy the database.ini.tmpl to database.ini and fill in your connection parameters. The *.ini files are not uploaded to the git repository for obvious security reasons.

Data

The data used by this project can be found on the site:

http://webdatacommons.org/largescaleproductcorpus/

A small 11 line sample file is (compress to .gz for use):

http://data.dws.informatik.uni-mannheim.de/largescaleproductcorpus/samples/sample_offersenglish.json

The big (16M+) English repository used for testing is:

http://data.dws.informatik.uni-mannheim.de/largescaleproductcorpus/data/offers_english.json.gz

The .gitignore file for this project prevents .gz file being added to the repository.

Run

When the database connection is specified and the data is downloaded you can run the wdc2pg.py python program.

python3 wdc2pg.py

To change the input json file or the basename (TABLEBASE} of the generated Postgres table change the parameter of the convert_json() function call at the bottom of the file.

The program creates two tables in the database. The first {TABLEBASE}_key contains the url, node_id pairs which are key to the offer and an autoincrement 'key'. The other {TABLEBASE}_offer table contains this offer 'key' together with the cluster_id and the 15 most used offer properties as listed in the WDC document Figure 1. The attributes from the top15 are automatically generated from the TOP15PROPERTIES list in the wdc2pg.py file. When properties are added or removed they will automatically be added to the table and converted by the python program.

The program is not very fast. The conversion of the complete 16M English offer will take approx 5 hours on my MacBook.

About

WDC probabalistic database bechmark for Dubio database

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages