for now, need html parsers in python. i'm collecting data from many different websites, storing them in a ZODB database. Still in data collection phase, as there are 50+ sites that need to be parsed. The data I'm collecting is analyzing the betting odds of various sports events. The sites are sportsbetting websites.
the following is a spec, with an attached file as an example of a site parsed properly.
*****
Each parser will extract info from sports betting sites. Data extracted is supposed to fill several instances of the Bet class:
class Bet(object):
def __init__(self):
[login to view URL] = '
[login to view URL] = '
[login to view URL] = -1
[login to view URL] = '
[login to view URL] = -1
[login to view URL] = '
[login to view URL] = None
[login to view URL] = '
[login to view URL] = '
[login to view URL] = '
[login to view URL] = '
[login to view URL] = 0.0
[login to view URL] = 0.0
[login to view URL] = 0.0
[login to view URL] = 2
[login to view URL] = 1000.0
The attributes:
[login to view URL] is the site from which parsed data come from.
[login to view URL], [login to view URL] are the participants. away is the visiting, for instance, in a line like "NY Nicks at Chicago Bulls", (NY Knicks plays at Chicago Bulls) NY Knicks is the awayTeam and Chicago Bulls the homeTeam.
[login to view URL] and [login to view URL] are calculated before and should be ignored.
[login to view URL] can be either spread, over/under or moneyline.
[login to view URL] is the game date. Should be a [login to view URL] instance.
[login to view URL] is the sport name, in lowercase.
[login to view URL] is the league name, when available, in lowercase.
[login to view URL] is the gamepart name ('game', '1st quarter', 'half',
etc), in lowercase.
[login to view URL] is the bet side, either 'home', 'away' or 'draw'
[login to view URL], [login to view URL] and [login to view URL] are float values with
the odds for each bet. Which one is available (!= 0.0) depends on the
betType
[login to view URL] refers to 2 or 3-line bets. On 2-line bets, you get your
money back in case of draw. On 3-line, you lose money if draw, but you
can bet on draw too , so the parser should generate a bet object for
draw too.
[login to view URL] is the max bet value allowed. default value is 1000.00.
The parsers should follow the following specs:
A parser module should have a Connection and BetFactory classes, with
the following interface:
class Connection(object):
def __init__(self):
[login to view URL] = False
self.build_opener()
def reset(self):
[login to view URL]()
[login to view URL] = False
self.build_opener()
def close(self):
[login to view URL]()
[login to view URL] = False
def build_opener(self):
[login to view URL] = urllib2.build_opener()
def login(self):
raise NotImplementedError
def get_data(self, sports=SPORTS):
raise NotImplementedError
class BetFactory(object):
def extract_bets(self, data, sports=SPORTS):
raise NotImplementedError
Cookie management and everything else related to connection should be
done on Connection.build_opener() and [login to view URL]() methods. It's
ok to have some parsing code for login page on [login to view URL](), but
parsing and networking code should be completely isolated.
Connection.get_data() should connect to the site and extract all data
needed, and return it on the format expected by
BetFactory.extract_bets(). There's some preference if data can be a
single string, but it's ok to be in any format, as long as it's
exactly what BetFactory.extract_bets() expect. BetFactory should
create all HTMLParsers subclasses instances needed and use them to
parse raw data.
BetFactory.extract_bets() should return a sequence (or a generator if
possible) with Bet instances for each line parsed.
HTML parsing should be done using Python standard library HTMLParser
module, unless dealing with buggy html pages and something else is
needed. HTTP access should be done using urllib2, urllib, and
clientcookie modules. External modules should be avoided as much as
possible.
Some styling rules: each HTML page should have it's own parser
implemented on a [login to view URL] subclass don't use from x import *
paying anywhere from $30-$70 a site depending on the site difficulty and how fast the code can be written