This blog posts shows how to build a Twitter bot using Google Spreadsheets as data source in Python.
The service presented here was originally created for a friend of mine who works in Megacorp Inc. They have a marketing intelligence department that is filling out stalked information about potential customers. This information stored in Google Spreadsheet. Every day a new spreadsheet arrives to a folder. Then my friend proceeds to go through all of the leads in the spreadsheet, check who have a Twitter account and harass them in Twitter about Megacorp Inc. products.
To make my friend jobless I decided to replace his tedious workflow with a Python script. Python is a programming language for making simple tasks simple, eliminating the feeling of repeating yourself with as little lines as possible. So it is a good weapon of choice for crushing middle class labor force participation.
The bot sends two tweets to every Twitter user. Timing between tweets and the second tweet is randomized, just to make sure that no one could not figure out in a blink of an eye that they are actually communicating with a bot.
The ingredients of this Twitter bot are
- Python 3.4+ – a snake programming language loved by everyone
- gspread – a Python client for Google Spreadsheets making reading and manipulating data less painful
- tweepy – A Twitter client library for Python
- ZODB – An ACID compliant transaction database for native Python objects
The script is pretty much self-contained, around 200 lines of Python code and 3 hours of work.
1. Authenticating for third party services
The bot uses OAuth protocol to authenticate itself against Google services (Google Drive, Google Spreadsheet) and Twitter. In OAuth, you arrive to a service provider web site through your normal web browser. If you are not yet logged in the service asks you log in. Then you get this page where it asks authorize the app. Twitter authentication is done in a separate script run tweepyauth.py which asks you enter the pin number shown on Twitter website. Google API client does things different and spins up a local web server running in a localhost port. When you authorize on Google services it redirects you back to the local webserver and the script grabs the authentication token from there.
The script stores authentication tokens in JSON files You can run the script on your local computer first to generate JSON files and then move it to the server where a web browser for authentication is not possibly available.
2. Maintaining persistent state
The bot needs to maintain a state. It needs to process a new spreadsheet every day. But on some days the bot might not be running. Thus, it needs to remember already processed spreadsheets. Sometimes the spreadsheets may contain duplicate entries of the same Twitter handle and we don’t want to harass this Twitter user over and over again. Some data cleaning is applied to the column contents, as it might be raw Twitter handle, HTTP or HTTPS URL to a Twitter user – those marketing intelligence people are not very strict on what they spill in to their spreadsheets.
The state is maintained using a ZODB. ZODB is a transaction database, very robust. It is mature, probably older than some of the blog post readers, having multigigabyte deployments running factories around the world. It can run in-process like SQLite and doesn’t need other software running on the machine. It doesn’t need any ORM as it uses native Python objects. Thus, to make your application persistent you just stick your Python objects to ZODB root. Everything inside a transaction context manager is written to the disk or nothing is written to the disk.
As a side note using Google Spreadsheets over their REST API is painfully slow. If you need to process larger amounts of data it might be more efficient to download the data locally as CSV export and do it from there.
3. Usage instructions
This code is exemplary. You can’t use it as you do not have correct data or access to data. Use it to inspire your imagination. However if you were to use it would happen like this:
- Follow the official Python package installation guide to install the dependencies
- Register Twitter app. Leave callback URL empty.
- Register a Google services App.
- Run tweepyauth.py to get Twitter tokens stored in twitter_oauth.json.
- Run bot once on your local computer and authenticate it against Google services and write client_secrets.json.
- See that the bot starts working.
- Move the whole stuff to a server.
- Leave it running in a loop on Bash prompt forever: $ while true; do python chirper.py; sleep
4. Source code
chirper.py
""" Installation: pip install --upgrade oauth2client gspread google-api-python-client ZODB zodbpickle tweepy iso8601 """ import time import datetime import json import httplib2 import os import sys # Authorize server-to-server interactions from Google Compute Engine. from apiclient import discovery import oauth2client from oauth2client import client from oauth2client import tools # ZODB import ZODB import ZODB.FileStorage import BTrees.OOBTree from persistent.mapping import PersistentMapping import random import transaction # Date parsing import iso8601 # https://github.com/burnash/gspread import gspread # Twitter client import tweepy try: import argparse flags = argparse.ArgumentParser(parents=[tools.argparser]).parse_args() except ImportError: flags = None # We need permissions to drive list files, drive read files, spreadsheet manipulation SCOPES = ['https://www.googleapis.com/auth/devstorage.read_write', 'https://www.googleapis.com/auth/drive.metadata.readonly', 'https://spreadsheets.google.com/feeds'] CLIENT_SECRET_FILE = 'client_secrets.json' APPLICATION_NAME = 'MEGACORP SPREADSHEET SCRAPER BOT' OAUTH_DATABASE = "oauth_authorization.json" FIRST_TWEET_CHOICES = [ "WE AT MEGACORP THINK YOU MIGHT LIKE US - http://megacorp.example.com", ] SECOND_TWEET_CHOICES = [ "AS WELL, WE ARE PROBABLY CHEAPER THAN COMPETITORCORP INC. http://megacorp.example.com/prices", "AS WELL, OUR FEATURE SET IS LONGER THAN MISSISSIPPI http://megacorp.example.com/features", "AS WELL, OUR CEO IS VERY HANDSOME http://megacorp.example.com/team", ] # Make sure our text is edited correctly for tweet in FIRST_TWEET_CHOICES + SECOND_TWEET_CHOICES: assert len(tweet) < 140 # How many tweets can be send in one run... limit for testing / debugging MAX_TWEET_COUNT = 10 # https://developers.google.com/drive/web/quickstart/python def get_google_credentials(): """Gets valid user credentials from storage. If nothing has been stored, or if the stored credentials are invalid, the OAuth2 flow is completed to obtain the new credentials. Returns: Credentials, the obtained credential. """ credential_path = os.path.join(os.getcwd(), OAUTH_DATABASE) store = oauth2client.file.Storage(credential_path) credentials = store.get() if not credentials or credentials.invalid: flow = client.flow_from_clientsecrets(CLIENT_SECRET_FILE, SCOPES) flow.user_agent = APPLICATION_NAME if flags: credentials = tools.run_flow(flow, store, flags) else: # Needed only for compatability with Python 2.6 credentials = tools.run(flow, store) print('Storing credentials to ' + credential_path) return credentials def get_tweepy(): """Create a Tweepy client instance.""" creds = json.load(open("twitter_oauth.json", "rt")) auth = tweepy.OAuthHandler(creds["consumer_key"], creds["consumer_secret"]) auth.set_access_token(creds["access_token"], creds["access_token_secret"]) api = tweepy.API(auth) return api def get_database(): """Get or create a ZODB database where we store information about processed spreadsheets and sent tweets.""" storage = ZODB.FileStorage.FileStorage('chirper.data.fs') db = ZODB.DB(storage) connection = db.open() root = connection.root # Initialize root data structure if not present yet with transaction.manager: if not hasattr(root, "files"): root.files = BTrees.OOBTree.BTree() if not hasattr(root, "twitter_handles"): # Format of {added: datetime, imported: datetime, sheet: str, first_tweet_at: datetime, second_tweet_at: datetime} root.twitter_handles = BTrees.OOBTree.BTree() return root def extract_twitter_handles(spread, sheet_id, column_id="L"): """Process one spreadsheet and return Twitter handles in it.""" twitter_url_prefix = ["https://twitter.com/", "http://twitter.com/"] worksheet = spread.open_by_key(sheet_id).sheet1 col_index = ord(column_id) - ord("A") + 1 # Painfully slow, 2600 records = 3+ min. start = time.time() print("Fetching data from sheet {}".format(sheet_id)) twitter_urls = worksheet.col_values(col_index) print("Fetched everything in {} seconds".format(time.time() - start)) valid_handles = [] # Cell contents are URLs (possibly) pointing to a Twitter # Extract the Twitter handle from these urls if they exist for cell_content in twitter_urls: if not cell_content: continue # Twitter handle as it if "://" not in cell_content: valid_handles.append(cell_content.strip()) continue # One cell can contain multiple URLs, comma separated urls = [url.strip() for url in cell_content.split(",")] for url in urls: for prefix in twitter_url_prefix: if url.startswith(prefix): handle = url[len(prefix):] # Clean old style fragment URLs e.g #!/foobar if handle.startswith("#!/"): handle = handle[len("#!/"):] valid_handles.append(handle) return valid_handles def watch_files(http, title_match=None, folder_id=None) -> list: """Check all Google Drive files which match certain file pattern. Drive API: https://developers.google.com/drive/web/search-parameters :return: Iterable GDrive file list """ service = discovery.build('drive', 'v2', http=http) if folder_id: results = service.files().list(q="'{}' in parents".format(folder_id)).execute() elif title_match: results = service.files().list(q="title contains '{}'".format(title_match)).execute() else: raise RuntimeError("Unknown criteria") return results["items"] def scan_for_new_spreadsheets(http, db): """Check Google Drive for new spreadsheets. 1. Use Google Drive API to list all files matching our spreadsheet criteria 2. If the file is not seen before add it to our list of files to process """ # First discover new spreadsheets discovered = False for file in watch_files(http, folder_id="0BytechWnbrJVTlNqbGpWZllaYW8"): title = file["title"] last_char = title[-1] # It's .csv, photos, etc. misc files if not last_char.isdigit(): continue with transaction.manager: file_id = file["id"] if file_id not in db.files: print("Discovered file {}: {}".format(file["title"], file_id)) db.files[file_id] = PersistentMapping(file) discovered = True if not discovered: print("No new spreadsheets available") def extract_twitter_handles_from_spreadsheets(spread, db): """Extract new Twitter handles from spreadsheets. 1. Go through all spreadsheets we know. 2. If the spreadsheet is not marked as processed extract Twitter handles out of it 3. If any of the Twitter handles is unseen before add it to the database with empty record """ # Then extract Twitter handles from the files we know about for file_id, file_data in db.files.items(): spreadsheet_creation_date = iso8601.parse_date(file_data["createdDate"]) print("Processing {} created at {}".format(file_data["title"], spreadsheet_creation_date)) # Check the processing flag on the file if not file_data.get("processed"): handles = extract_twitter_handles(spread, file_id) # Using this transaction lock we write all the handles to the database once or none of them with transaction.manager: for handle in handles: # If we have not seen this if handle not in db.twitter_handles: print("Importing Twitter handle {}".format(handle)) db.twitter_handles[handle] = PersistentMapping({"added": spreadsheet_creation_date, "imported": datetime.datetime.utcnow(), "sheet": file_id}) file_data["processed"] = True def send_tweet(twitter, msg): """Send a Tweet. """ try: twitter.update_status(status=msg) except tweepy.error.TweepError as e: try: # {"errors":[{"code":187,"message":"Status is a duplicate."}]} resp = json.loads(e.response.text) if resp.get("errors"): if resp["errors"][0]["code"] == 187: print("Was duplicate {}".format(msg)) time.sleep(10 + random.randint(0, 10)) return except: pass raise RuntimeError("Twitter doesn't like us: {}".format(e.response.text or str(e))) from e # Throttle down the bot time.sleep(30 + random.randint(0, 90)) def tweet_everything(twitter, db): """Run through all users and check if we need to Tweet to them. """ tweet_count = 0 for handle_id, handle_data in db.twitter_handles.items(): with transaction.manager: # Check if we had not sent the first Tweet yet and send it if not handle_data.get("first_tweet_at"): tweet = "@{} {}".format(handle_id, random.choice(FIRST_TWEET_CHOICES)) print("Tweeting {} at {}".format(tweet, datetime.datetime.utcnow())) send_tweet(twitter, tweet) handle_data["first_tweet_at"] = datetime.datetime.utcnow() tweet_count += 1 # Check if we had not sent the first Tweet yet and send it elif not handle_data.get("second_tweet_at"): tweet = "@{} {}".format(handle_id, random.choice(SECOND_TWEET_CHOICES)) print("Tweeting {} at {}".format(tweet, datetime.datetime.utcnow())) send_tweet(twitter, tweet) handle_data["second_tweet_at"] = datetime.datetime.utcnow() tweet_count += 1 if tweet_count >= MAX_TWEET_COUNT: # Testing limiter - don't spam too much if our test run is out of control break def main(): script_name = sys.argv[1] if sys.argv[0] == "python" else sys.argv[0] print("Starting {} at {} UTC".format(script_name, datetime.datetime.utcnow())) # open database db = get_database() # get OAuth permissions from Google for Drive client and Spreadsheet client credentials = get_google_credentials() http = credentials.authorize(httplib2.Http()) spread = gspread.authorize(credentials) twitter = get_tweepy() # Do action scan_for_new_spreadsheets(http, db) extract_twitter_handles_from_spreadsheets(spread, db) tweet_everything(twitter, db) main()
tweepyauth.py
import json import webbrowser import tweepy """ Query the user for their consumer key/secret then attempt to fetch a valid access token. """ if __name__ == "__main__": consumer_key = input('Consumer key: ').strip() consumer_secret = input('Consumer secret: ').strip() auth = tweepy.OAuthHandler(consumer_key, consumer_secret) # Open authorization URL in browser webbrowser.open(auth.get_authorization_url()) # Ask user for verifier pin pin = input('Verification pin number from twitter.com: ').strip() # Get access token access_token, access_token_secret = auth.get_access_token(verifier=pin) data = dict(consumer_key=consumer_key, consumer_secret=consumer_secret, access_token=access_token, access_token_secret=access_token_secret) with open("twitter_oauth.json", "wt") as f: json.dump(data, f)
Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+