Twitter bot using Google Spreadsheets in Python

Posted on 2015-10-14 by Mikko Ohtamaa

This blog posts shows how to build a Twitter bot using Google Spreadsheets as data source in Python.

The service presented here was originally created for a friend of mine who works in Megacorp Inc. They have a marketing intelligence department that is filling out stalked information about potential customers. This information stored in Google Spreadsheet. Every day a new spreadsheet arrives to a folder. Then my friend proceeds to go through all of the leads in the spreadsheet, check who have a Twitter account and harass them in Twitter about Megacorp Inc. products.

To make my friend jobless I decided to replace his tedious workflow with a Python script. Python is a programming language for making simple tasks simple, eliminating the feeling of repeating yourself with as little lines as possible. So it is a good weapon of choice for crushing middle class labor force participation.

The bot sends two tweets to every Twitter user. Timing between tweets and the second tweet is randomized, just to make sure that no one could not figure out in a blink of an eye that they are actually communicating with a bot.

The ingredients of this Twitter bot are

Python 3.4+ – a snake programming language loved by everyone
gspread – a Python client for Google Spreadsheets making reading and manipulating data less painful
tweepy – A Twitter client library for Python
ZODB – An ACID compliant transaction database for native Python objects

The script is pretty much self-contained, around 200 lines of Python code and 3 hours of work.

1. Authenticating for third party services

The bot uses OAuth protocol to authenticate itself against Google services (Google Drive, Google Spreadsheet) and Twitter. In OAuth, you arrive to a service provider web site through your normal web browser. If you are not yet logged in the service asks you log in. Then you get this page where it asks authorize the app. Twitter authentication is done in a separate script run tweepyauth.py which asks you enter the pin number shown on Twitter website. Google API client does things different and spins up a local web server running in a localhost port. When you authorize on Google services it redirects you back to the local webserver and the script grabs the authentication token from there.

The script stores authentication tokens in JSON files You can run the script on your local computer first to generate JSON files and then move it to the server where a web browser for authentication is not possibly available.

2. Maintaining persistent state

The bot needs to maintain a state. It needs to process a new spreadsheet every day. But on some days the bot might not be running. Thus, it needs to remember already processed spreadsheets. Sometimes the spreadsheets may contain duplicate entries of the same Twitter handle and we don’t want to harass this Twitter user over and over again. Some data cleaning is applied to the column contents, as it might be raw Twitter handle, HTTP or HTTPS URL to a Twitter user – those marketing intelligence people are not very strict on what they spill in to their spreadsheets.

The state is maintained using a ZODB. ZODB is a transaction database, very robust. It is mature, probably older than some of the blog post readers, having multigigabyte deployments running factories around the world. It can run in-process like SQLite and doesn’t need other software running on the machine. It doesn’t need any ORM as it uses native Python objects. Thus, to make your application persistent you just stick your Python objects to ZODB root. Everything inside a transaction context manager is written to the disk or nothing is written to the disk.

As a side note using Google Spreadsheets over their REST API is painfully slow. If you need to process larger amounts of data it might be more efficient to download the data locally as CSV export and do it from there.

3. Usage instructions

This code is exemplary. You can’t use it as you do not have correct data or access to data. Use it to inspire your imagination. However if you were to use it would happen like this:

Follow the official Python package installation guide to install the dependencies
Register Twitter app. Leave callback URL empty.
Register a Google services App.
Run tweepyauth.py to get Twitter tokens stored in twitter_oauth.json.
Run bot once on your local computer and authenticate it against Google services and write client_secrets.json.
See that the bot starts working.
Move the whole stuff to a server.
Leave it running in a loop on Bash prompt forever: $ while true; do python chirper.py; sleep

4. Source code

chirper.py

"""

Installation:

    pip install --upgrade oauth2client gspread google-api-python-client ZODB zodbpickle tweepy iso8601
"""

import time
import datetime
import json
import httplib2
import os
import sys

# Authorize server-to-server interactions from Google Compute Engine.
from apiclient import discovery
import oauth2client
from oauth2client import client
from oauth2client import tools

# ZODB
import ZODB
import ZODB.FileStorage
import BTrees.OOBTree
from persistent.mapping import PersistentMapping
import random
import transaction

# Date parsing
import iso8601

# https://github.com/burnash/gspread
import gspread

# Twitter client
import tweepy

try:
    import argparse
    flags = argparse.ArgumentParser(parents=[tools.argparser]).parse_args()
except ImportError:
    flags = None


# We need permissions to drive list files, drive read files, spreadsheet manipulation
SCOPES = ['https://www.googleapis.com/auth/devstorage.read_write', 'https://www.googleapis.com/auth/drive.metadata.readonly', 'https://spreadsheets.google.com/feeds']
CLIENT_SECRET_FILE = 'client_secrets.json'
APPLICATION_NAME = 'MEGACORP SPREADSHEET SCRAPER BOT'
OAUTH_DATABASE = "oauth_authorization.json"

FIRST_TWEET_CHOICES = [
    "WE AT MEGACORP THINK YOU MIGHT LIKE US - http://megacorp.example.com",
]

SECOND_TWEET_CHOICES = [
    "AS WELL, WE ARE PROBABLY CHEAPER THAN COMPETITORCORP INC. http://megacorp.example.com/prices",
    "AS WELL, OUR FEATURE SET IS LONGER THAN MISSISSIPPI http://megacorp.example.com/features",
    "AS WELL, OUR CEO IS VERY HANDSOME http://megacorp.example.com/team",

]

# Make sure our text is edited correctly
for tweet in FIRST_TWEET_CHOICES + SECOND_TWEET_CHOICES:
    assert len(tweet) < 140

# How many tweets can be send in one run... limit for testing / debugging
MAX_TWEET_COUNT = 10


# https://developers.google.com/drive/web/quickstart/python
def get_google_credentials():
    """Gets valid user credentials from storage.

    If nothing has been stored, or if the stored credentials are invalid,
    the OAuth2 flow is completed to obtain the new credentials.

    Returns:
        Credentials, the obtained credential.
    """

    credential_path = os.path.join(os.getcwd(), OAUTH_DATABASE)

    store = oauth2client.file.Storage(credential_path)
    credentials = store.get()
    if not credentials or credentials.invalid:
        flow = client.flow_from_clientsecrets(CLIENT_SECRET_FILE, SCOPES)
        flow.user_agent = APPLICATION_NAME
        if flags:
            credentials = tools.run_flow(flow, store, flags)
        else: # Needed only for compatability with Python 2.6
            credentials = tools.run(flow, store)
        print('Storing credentials to ' + credential_path)
    return credentials


def get_tweepy():
    """Create a Tweepy client instance."""
    creds = json.load(open("twitter_oauth.json", "rt"))

    auth = tweepy.OAuthHandler(creds["consumer_key"], creds["consumer_secret"])
    auth.set_access_token(creds["access_token"], creds["access_token_secret"])
    api = tweepy.API(auth)
    return api


def get_database():
    """Get or create a ZODB database where we store information about processed spreadsheets and sent tweets."""

    storage = ZODB.FileStorage.FileStorage('chirper.data.fs')
    db = ZODB.DB(storage)
    connection = db.open()
    root = connection.root

    # Initialize root data structure if not present yet
    with transaction.manager:
        if not hasattr(root, "files"):
            root.files = BTrees.OOBTree.BTree()
        if not hasattr(root, "twitter_handles"):
            # Format of {added: datetime, imported: datetime, sheet: str, first_tweet_at: datetime, second_tweet_at: datetime}
            root.twitter_handles = BTrees.OOBTree.BTree()


    return root


def extract_twitter_handles(spread, sheet_id, column_id="L"):
    """Process one spreadsheet and return Twitter handles in it."""

    twitter_url_prefix = ["https://twitter.com/", "http://twitter.com/"]

    worksheet = spread.open_by_key(sheet_id).sheet1

    col_index = ord(column_id) - ord("A") + 1

    # Painfully slow, 2600 records = 3+ min.
    start = time.time()
    print("Fetching data from sheet {}".format(sheet_id))
    twitter_urls =  worksheet.col_values(col_index)
    print("Fetched everything in {} seconds".format(time.time() - start))

    valid_handles = []

    # Cell contents are URLs (possibly) pointing to a Twitter
    # Extract the Twitter handle from these urls if they exist
    for cell_content in twitter_urls:

        if not cell_content:
            continue

        # Twitter handle as it
        if "://" not in cell_content:
            valid_handles.append(cell_content.strip())
            continue

        # One cell can contain multiple URLs, comma separated
        urls = [url.strip() for url in cell_content.split(",")]

        for url in urls:
            for prefix in twitter_url_prefix:
                if url.startswith(prefix):
                    handle = url[len(prefix):]

                    # Clean old style fragment URLs e.g #!/foobar
                    if handle.startswith("#!/"):
                        handle = handle[len("#!/"):]

                    valid_handles.append(handle)

    return valid_handles


def watch_files(http, title_match=None, folder_id=None) -> list:
    """Check all Google Drive files which match certain file pattern.

    Drive API:

    https://developers.google.com/drive/web/search-parameters

    :return: Iterable GDrive file list
    """

    service = discovery.build('drive', 'v2', http=http)

    if folder_id:
        results = service.files().list(q="'{}' in parents".format(folder_id)).execute()
    elif title_match:
        results = service.files().list(q="title contains '{}'".format(title_match)).execute()
    else:
        raise RuntimeError("Unknown criteria")

    return results["items"]


def scan_for_new_spreadsheets(http, db):
    """Check Google Drive for new spreadsheets.

        1. Use Google Drive API to list all files matching our spreadsheet criteria
        2. If the file is not seen before add it to our list of files to process
    """
    # First discover new spreadsheets

    discovered = False

    for file in watch_files(http, folder_id="0BytechWnbrJVTlNqbGpWZllaYW8"):
        title = file["title"]
        last_char = title[-1]

        # It's .csv, photos, etc. misc files
        if not last_char.isdigit():
            continue

        with transaction.manager:
            file_id = file["id"]
            if file_id not in db.files:
                print("Discovered file {}: {}".format(file["title"], file_id))
                db.files[file_id] = PersistentMapping(file)
                discovered = True

    if not discovered:
        print("No new spreadsheets available")


def extract_twitter_handles_from_spreadsheets(spread, db):
    """Extract new Twitter handles from spreadsheets.

        1. Go through all spreadsheets we know.
        2. If the spreadsheet is not marked as processed extract Twitter handles out of it
        3. If any of the Twitter handles is unseen before add it to the database with empty record

    """

    # Then extract Twitter handles from the files we know about
    for file_id, file_data in db.files.items():

        spreadsheet_creation_date = iso8601.parse_date(file_data["createdDate"])

        print("Processing {} created at {}".format(file_data["title"], spreadsheet_creation_date))

        # Check the processing flag on the file
        if not file_data.get("processed"):
            handles = extract_twitter_handles(spread, file_id)

            # Using this transaction lock we write all the handles to the database once or none of them
            with transaction.manager:
                for handle in handles:
                    # If we have not seen this
                    if handle not in db.twitter_handles:
                        print("Importing Twitter handle {}".format(handle))
                        db.twitter_handles[handle] = PersistentMapping({"added": spreadsheet_creation_date, "imported": datetime.datetime.utcnow(), "sheet": file_id})

                file_data["processed"] = True


def send_tweet(twitter, msg):
    """Send a Tweet.
    """

    try:
        twitter.update_status(status=msg)
    except tweepy.error.TweepError as e:
        try:
            # {"errors":[{"code":187,"message":"Status is a duplicate."}]}
            resp = json.loads(e.response.text)
            if resp.get("errors"):
                if resp["errors"][0]["code"] == 187:
                    print("Was duplicate {}".format(msg))
                    time.sleep(10 + random.randint(0, 10))
                    return
        except:
            pass

        raise RuntimeError("Twitter doesn't like us: {}".format(e.response.text or str(e))) from e

    # Throttle down the bot
    time.sleep(30 + random.randint(0, 90))


def tweet_everything(twitter, db):
    """Run through all users and check if we need to Tweet to them. """

    tweet_count = 0

    for handle_id, handle_data in db.twitter_handles.items():

        with transaction.manager:

            # Check if we had not sent the first Tweet yet and send it
            if not handle_data.get("first_tweet_at"):

                tweet = "@{} {}".format(handle_id, random.choice(FIRST_TWEET_CHOICES))

                print("Tweeting {} at {}".format(tweet, datetime.datetime.utcnow()))
                send_tweet(twitter, tweet)
                handle_data["first_tweet_at"] = datetime.datetime.utcnow()
                tweet_count += 1

            # Check if we had not sent the first Tweet yet and send it
            elif not handle_data.get("second_tweet_at"):

                tweet = "@{} {}".format(handle_id, random.choice(SECOND_TWEET_CHOICES))

                print("Tweeting {} at {}".format(tweet, datetime.datetime.utcnow()))
                send_tweet(twitter, tweet)
                handle_data["second_tweet_at"] = datetime.datetime.utcnow()
                tweet_count += 1

        if tweet_count >= MAX_TWEET_COUNT:
            # Testing limiter - don't spam too much if our test run is out of control
            break


def main():

    script_name = sys.argv[1] if sys.argv[0] == "python" else sys.argv[0]
    print("Starting {} at {} UTC".format(script_name, datetime.datetime.utcnow()))

    # open database
    db = get_database()

    # get OAuth permissions from Google for Drive client and Spreadsheet client
    credentials = get_google_credentials()
    http = credentials.authorize(httplib2.Http())
    spread = gspread.authorize(credentials)
    twitter = get_tweepy()

    # Do action
    scan_for_new_spreadsheets(http, db)
    extract_twitter_handles_from_spreadsheets(spread, db)
    tweet_everything(twitter, db)


main()

tweepyauth.py

import json
import webbrowser

import tweepy

"""
    Query the user for their consumer key/secret
    then attempt to fetch a valid access token.
"""

if __name__ == "__main__":

    consumer_key = input('Consumer key: ').strip()
    consumer_secret = input('Consumer secret: ').strip()
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

    # Open authorization URL in browser
    webbrowser.open(auth.get_authorization_url())

    # Ask user for verifier pin
    pin = input('Verification pin number from twitter.com: ').strip()

    # Get access token
    access_token, access_token_secret = auth.get_access_token(verifier=pin)

    data = dict(consumer_key=consumer_key, consumer_secret=consumer_secret, access_token=access_token, access_token_secret=access_token_secret)
    with open("twitter_oauth.json", "wt") as f:
        json.dump(data, f)

Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+

Fixing POSKeyError: ‘No blob file’ content in Plone

Posted on 2012-01-05 by Mikko Ohtamaa

Plone CMS 4.x onwards stores uploaded files and images in ZODB as BLOBs. They exist in var/blobstorage folder structure on the file system, files being named after persistent object ids (non-human-readable). The objects themselves, without file payload, are stored in an append-only database file called filestorage and usually the name of this file is Data.fs.

If you copy Plone site database object data (Data.fs) and forget to copy blobstorage folder, or data gets out of the sync during the copy, various problems appear on the Plone site

You cannot access a content item which has a corresponding blob file missing on the file system
You cannot rebuild the portal_catalog indexes
Database packing may fail

Instead, you’ll see something like this – an evil POSKeyError exception (Persistent Object Storage):

Traceback (most recent call last):
  File "/fast/xxx/eggs/ZODB3-3.10.3-py2.6-macosx-10.6-i386.egg/ZODB/Connection.py", line 860, in setstate
    self._setstate(obj)
  File "/fast/xxx/eggs/ZODB3-3.10.3-py2.6-macosx-10.6-i386.egg/ZODB/Connection.py", line 922, in _setstate
    obj._p_blob_committed = self._storage.loadBlob(obj._p_oid, serial)
  File "/fast/xxx/eggs/ZODB3-3.10.3-py2.6-macosx-10.6-i386.egg/ZODB/blob.py", line 644, in loadBlob
    raise POSKeyError("No blob file", oid, serial)
POSKeyError: 'No blob file'

The proper solution is to fix this problem is to

Re-copy blobstorage folder
Restart Plone twice in foreground mode (sometimes freshly copied blobstorage folder does not get picked up – some kind of timestamp issue?)
Plone site copy instructions

However you may have failed. You may have damaged or lost your blobstorage forever. To get the Plone site to a working state all content having bad BLOB data must be deleted (usually meaning losing some of site images and uploaded files).

Below is Python code for Grok view which you can drop in to your own Plone add-on product. It creates an admin view which you can call directly thru URL. This code will walk thru all the content on your Plone site and tries to delete bad content items with BLOBs missing.

The code handles both Archetypes and Dexterity subsystems’ content types.

Note: Fixing Dexterity blobs with this code have never been tested – please feel free to update the code in collective.developermanual on GitHub if you find it not working properly.

The code, fixblobs.py:

"""

    A Zope command line script to delete content with missing BLOB in Plone, causing
    POSKeyErrors when content is being accessed or during portal_catalog rebuild.

    Tested on Plone 4.1 + Dexterity 1.1.

    http://stackoverflow.com/questions/8655675/cleaning-up-poskeyerror-no-blob-file-content-from-plone-site

    Also see:

    http://pypi.python.org/pypi/experimental.gracefulblobmissing/

"""

# Zope imports
from ZODB.POSException import POSKeyError
from zope.component import getMultiAdapter
from zope.component import queryUtility
from Products.CMFCore.interfaces import IPropertiesTool
from Products.CMFCore.interfaces import IFolderish, ISiteRoot

# Plone imports
from five import grok
from Products.Archetypes.Field import FileField
from Products.Archetypes.interfaces import IBaseContent
from plone.namedfile.interfaces import INamedFile
from plone.dexterity.content import DexterityContent

def check_at_blobs(context):
    """ Archetypes content checker.

    Return True if purge needed
    """

    if IBaseContent.providedBy(context):

        schema = context.Schema()
        for field in schema.fields():
            id = field.getName()
            if isinstance(field, FileField):
                try:
                    field.get_size(context)
                except POSKeyError:
                    print "Found damaged AT FileField %s on %s" % (id, context.absolute_url())
                    return True

    return False

def check_dexterity_blobs(context):
    """ Check Dexterity content for damaged blob fields

    XXX: NOT TESTED - THEORETICAL, GUIDELINING, IMPLEMENTATION

    Return True if purge needed
    """

    # Assume dexterity contennt inherits from Item
    if isinstance(context, DexterityContent):

        # Iterate through all Python object attributes
        # XXX: Might be smarter to use zope.schema introspection here?
        for key, value in context.__dict__.items():
            # Ignore non-contentish attributes to speed up us a bit
            if not key.startswith("_"):
                if INamedFile.providedBy(value):
                    try:
                        value.getSize()
                    except POSKeyError:
                        print "Found damaged Dexterity plone.app.NamedFile %s on %s" % (key, context.absolute_url())
                        return True
    return False

def fix_blobs(context):
    """
    Iterate through the object variables and see if they are blob fields
    and if the field loading fails then poof
    """

    if check_at_blobs(context) or check_dexterity_blobs(context):
        print "Bad blobs found on %s" % context.absolute_url() + " -> deleting"
        parent = context.aq_parent
        parent.manage_delObjects([context.getId()])

def recurse(tree):
    """ Walk through all the content on a Plone site """
    for id, child in tree.contentItems():

        fix_blobs(child)

        if IFolderish.providedBy(child):
            recurse(child)

class FixBlobs(grok.CodeView):
    """
    A management view to clean up content with damaged BLOB files

    You can call this view by

    1) Starting Plone in debug mode (console output available)

    2) Visit site.com/@@fix-blobs URL

    """
    grok.name("fix-blobs")
    grok.context(ISiteRoot)
    grok.require("cmf.ManagePortal")

    def disable_integrity_check(self):
        """  Content HTML may have references to this broken image - we cannot fix that HTML
        but link integriry check will yell if we try to delete the bad image.

        http://collective-docs.readthedocs.org/en/latest/content/deleting.html#bypassing-link-integrity-check "
        """
        ptool = queryUtility(IPropertiesTool)
        props = getattr(ptool, 'site_properties', None)
        self.old_check = props.getProperty('enable_link_integrity_checks', False)
        props.enable_link_integrity_checks = False

    def enable_integrity_check(self):
        """ """
        ptool = queryUtility(IPropertiesTool)
        props = getattr(ptool, 'site_properties', None)
        props.enable_link_integrity_checks = self.old_check

    def render(self):
        #plone = getMultiAdapter((self.context, self.request), name="plone_portal_state")
        print "Checking blobs"
        portal = self.context
        self.disable_integrity_check()
        recurse(portal)
        self.enable_integrity_check()
        print "All done"
        return "OK - check console for status messages"

More info

Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+

sauna.reload – the most awesomely named Python package ever

Posted on 2011-11-08 by Mikko Ohtamaa

This blog post is about sauna.reload which is a Python package adding an automatic reload feature to Plone CMS.

1. The short history of reloading

The web developer community takes reload-code-on-development as granted. You edit your source file, hit refresh and poof: your changed are there. It is a good way and the only real web development way and one of the sane things PHP community has taught to us. Things could be different: for example if you are developing an embedded code on mobile platforms you might need to build the ROM image for hours before you see your changed source code line in the action.

Side note: To check your code changes it is often faster the hit enter on the browsee address bar line as this does not reload CSS and JS files for the HTTP request

For PHP the reloading is easy as the software stack parses every PHP file again on every request. There is no reload – only load. In fact there is a whole “optimizer” open source business model spawned around PHP just to make it faster. PHP processes are stateless.

For Python the things are not so simple. Python web frameworks have separate process lifespan and HTTP request handling lifespans. Often a server process is started, it spawns N threads and each thread handles one HTTP request at a time. Python processes are stateful.

As being stateful, Python processes may start slowly. You need to parse all source code and create initial memory objects, etc. Frameworks do not optimize for fast start-up because it really doesn’t matter on the production service as you spawn Python processes only once, when the server is booted.

Plone CMS, coming with choking 250 MB of source code (Linux kernel has 400-500 MB) is the extreme of this slow initialization. Plone CMS is the biggest open source Python project out there (does anyone dare to challenge my clain?). When Plone starts it loads most of source code to memory and parsing and initializing Python itself, plus various XML files, is quite an achievement.

What Tornado, Django, Paster etc. Python frameworks do that when they reload they simply zap the process dead and reinitialize 100% virgin process. This is ok as these low level frameworks have little overhead. Though it becomes painful slow too when your Django service grows for many applications and the code starts piling up…

For Plone 100% start up this is no go. Plone people care about the start up time, but there is not much they can do about it… even the smallest sites have start up times of dozens of seconds.

2. Python and module reload

Plone used to have (has) a helper package called plone.reload. It works by reloading Python modules. Python interpreter allows you to reload a module again in the memory. The code is actually based on xreload.py written by Guido van Rossum himself.

However, there is a catch. Naive code reload does not work. As xreload.py comments state there are issues. Global stateful objects, other initialization specific code paths, etc. are ignored. So with plone.reload you ended up with a broken Python process as many times you ended up with a successful reload. It’s so frustrating to use that you don’t actually want to use it.

3. From Finland with Love

sauna.reload was initially created in Sauna Sprint 2011 event, arranged by EESTEC and Plone Community. The event name comes from the abundance of sauna in Finland and the package name comes from the fact that a lot of sauna was put into the effort of creating it.

4. Going trident

A fork. (The orignal image)

A spoon (The orignal image)

sauna.reload takes a different strategy of reloading

The idea behind sauna.reload goes back to the days as I was working with Python for Nokia Series 60 phones. Symbian seriously sucks, peace on its memory (though not sure if it’s dead yet or just a walking corpse). If you did any Symbian development you saw how it could never fly – such a monster it was.

One of the issues was start-up time of the apps, especially Python start up time. Symbian IO was slow and simple fact to running the application to the point where all imports had been run took several seconds. For all of this time you could just show some happy loading screen for the user. It was very difficult to deliver a Python application which would have acceptable user experience on Symbian.

Far back in the day I chatted with Jukka Laurila who gave me some tips. Emacs used to have unexec() command which simply dumped the application state to the disk and resumed later. This works nicely if your application is an isolated binary and does not have any references to DLLs, so it doesn’t work so nicely on any contemporary platform.

On the other hand, Maemo folks had something called PyLauncher. PyLauncher loaded a daemon to memory and this daemon had Python interpreter and pygtk library loaded. When Python script had to be run, the PyLauncher daemon simply fork()’ed a new child and this new child inherited already loaded Python interpreter and pygtk for free. Fork() is a copy-on-write operation on modern operating systems.

This gave me a cunning idea.

5. Taking a shortcut

When you develop a web application and you want to reload changes you usually don’t want to reload the whole stack. In fact, your custom code tends to sit up on the top of the framework stack and unless you are a framework core developers yourself, you never poke into those files.

What if we run Plone initialization process to the point it is just about to load your custom code, freeze the process and then just always go back to this point when we want to load our (changed) code?

Plone process is in fact a series of Python egg loads. Your custom eggs are loaded last.

6. With sofware only your imagination is the limit

So I presented the idea to Asko Soukka and Esa-Matti Suuronen from University of Jyväskylä (one of the largest Plone users in Finland) in Sauna Sprint 2011. If I recall correctly their first reaction was “wow dude that definitely is not going to work”.

But at the end of the tunnel was the freedom of Plone web developers – being released from the chains of Zope start-up times forever. Even if it had turned out sauna.reload was not possible we would have been stranded in the middle of the forest having nothing to do. So the guys decided to give it a crack. All the time they spent to develop sauna.reload would be paid back later with gracious multiplier.

sauna.reload is using awesome cross-platform Watchdog Python package to monitor the source code files. Plone is frozen in the start-up process before any custom source code eggs from src/ folder are being loaded. When there is a file-system change event sauna.reload kills the forked child process and reforks a new child from the frozen parent process, effectively continuing the loading from the clean table where none of your source code files were present.

7. Going uphill

San Francisco, where we are hosting Plone Conference 2011 and sprints, is a city of hills. They feel quite steep for a person who comes from a place where maps don’t have countours.

With Zope 2, the technology the ancients left to us, things were not that straightforward.

ZODB, the database used in Plone, happens in-process for the development instances (single process mode). When you go back to the orignal frozen copy of Plone process you’ll also travel back in append-only database time.

The sauna.reload devs managed to fix this. However this code is prone to ZODB internal changes as happened with Plone 4.1.

Another funky issues were how to monkey-patch Zope loading process early enough and reordering egg loading process so that we hold loading the custom code till very end.

8. For The Win

sauna.reload works on all Python packages which declare z3c.autoinclude loading in setup.py. Just go and use it. This, of course, is based on the assumption you use an operating system which supports fork() operation and unfortunately Windows is not privileged for this.

Note: OSX users might need to use trunk version of Watchdog until a bug fix is released

Your Plone restart time goes down to 1 second from 40 seconds. Also, the restart happens automatically every time you save your files. It works with Grok, Zope component architecture, everything… ZCML is reloaded too. You should feel an unsurmountable boost in your productivity.

We still have some quirks in sauna.reload. If you try hard enough you rarely might be able to corrupt database. We don’t know how, so keep trying and report your findings pack to GitHub issue tracker. Also as sauna.reload is meant for the development only you shouldn’t really worry about breaking things. We also have an old issue tracker which was not migrated to collective – how one can migrate a issue tracker on Github?

9. The Future

Who knows? Maybe someone gets inspiration of this work and ports it to other Python frameworks.

Our plan is to release a Plone developer distribution which comes with sauna.reload and other goodies by default. This way new Plone developers could instantly start hacking their code by just copy-pasting examples and keep hitting refresh.

Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+

scp file copy with on-line compression

Posted on 2011-03-02 by Mikko Ohtamaa

This might come handy if you need to copy a database file (Data.fs) from the remote server to your local computer:

scp -C -o CompressionLevel=9 user@yoursite.com:~/plonefolder/zinstance/var/filestorage/Data.fs .

This command line enables real-time compression with the highest compression level. This speeds up copying somewhat, as ZODB database compresses pretty well.

Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+

Open Source Hacker

Pushing the boundaries of free technology

Tag Archives: zodb

Twitter bot using Google Spreadsheets in Python

1. Authenticating for third party services

2. Maintaining persistent state

3. Usage instructions

4. Source code

Fixing POSKeyError: ‘No blob file’ content in Plone

sauna.reload – the most awesomely named Python package ever

1. The short history of reloading

2. Python and module reload

3. From Finland with Love

4. Going trident

5. Taking a shortcut

6. With sofware only your imagination is the limit

7. Going uphill

8. For The Win

9. The Future

scp file copy with on-line compression