Beginner's Guide to Programming - guidetoprogramming.com

  • Increase font size
  • Default font size
  • Decrease font size

Get data from a web page

This script is useful for gathering data from a web page.  This script will not work in Windows!  It will work in linux (I haven't tested any other operating systems).

While there are multiple methods for grabbing data from the web in python, I have found this to be the most simple and useful, and it delivers the easiest to read output.  It actually calls a linux shell command that invokes a browser, uses it to grab the data, and dumps the translates data to a web page.  It doesn't save the html, rather it saves formatted data.  The data isn't necessarily as clean as it would be if you were uding IE, Firefox or Chrome, but it lends itself easily to viewing or further programming to extract specific data.

 


#!/usr/bin/python

#this calls a webpage and dumps a screen scrape to a file
#usage webget(url, filename)


import os
import subprocess

#webget is created as a function where the usage is defined as "webget (site_address,output_file)"
def webget(site, output):
    #a linux shell command is called to dump the text to the file
    command = 'w3m -dump ' + site + ' > ' + output
    outfile = open(output, "w")
    filetext = subprocess.Popen(command, shell=True, stdout=file(output,"w"))
    filetextfix = filetext.communicate()

#now we call our function with the variables we want
webget ('http://www.yahoo.com','test.txt')