27978 html parsing in python

N/A

Igangværende

Slået op

næsten 18 år siden

N/A

Betales ved levering

for now, need html parsers in python. i'm collecting data from many different websites, storing them in a ZODB database. Still in data collection phase, as there are 50+ sites that need to be parsed. The data I'm collecting is analyzing the betting odds of various sports events. The sites are sportsbetting websites. the following is a spec, with an attached file as an example of a site parsed properly. ***** Each parser will extract info from sports betting sites. Data extracted is supposed to fill several instances of the Bet class: class Bet(object): def __init__(self): [login to view URL] = ' [login to view URL] = ' [login to view URL] = -1 [login to view URL] = ' [login to view URL] = -1 [login to view URL] = ' [login to view URL] = None [login to view URL] = ' [login to view URL] = ' [login to view URL] = ' [login to view URL] = ' [login to view URL] = 0.0 [login to view URL] = 0.0 [login to view URL] = 0.0 [login to view URL] = 2 [login to view URL] = 1000.0 The attributes: [login to view URL] is the site from which parsed data come from. [login to view URL], [login to view URL] are the participants. away is the visiting, for instance, in a line like "NY Nicks at Chicago Bulls", (NY Knicks plays at Chicago Bulls) NY Knicks is the awayTeam and Chicago Bulls the homeTeam. [login to view URL] and [login to view URL] are calculated before and should be ignored. [login to view URL] can be either spread, over/under or moneyline. [login to view URL] is the game date. Should be a [login to view URL] instance. [login to view URL] is the sport name, in lowercase. [login to view URL] is the league name, when available, in lowercase. [login to view URL] is the gamepart name ('game', '1st quarter', 'half', etc), in lowercase. [login to view URL] is the bet side, either 'home', 'away' or 'draw' [login to view URL], [login to view URL] and [login to view URL] are float values with the odds for each bet. Which one is available (!= 0.0) depends on the betType [login to view URL] refers to 2 or 3-line bets. On 2-line bets, you get your money back in case of draw. On 3-line, you lose money if draw, but you can bet on draw too , so the parser should generate a bet object for draw too. [login to view URL] is the max bet value allowed. default value is 1000.00. The parsers should follow the following specs: A parser module should have a Connection and BetFactory classes, with the following interface: class Connection(object): def __init__(self): [login to view URL] = False self.build_opener() def reset(self): [login to view URL]() [login to view URL] = False self.build_opener() def close(self): [login to view URL]() [login to view URL] = False def build_opener(self): [login to view URL] = urllib2.build_opener() def login(self): raise NotImplementedError def get_data(self, sports=SPORTS): raise NotImplementedError class BetFactory(object): def extract_bets(self, data, sports=SPORTS): raise NotImplementedError Cookie management and everything else related to connection should be done on Connection.build_opener() and [login to view URL]() methods. It's ok to have some parsing code for login page on [login to view URL](), but parsing and networking code should be completely isolated. Connection.get_data() should connect to the site and extract all data needed, and return it on the format expected by BetFactory.extract_bets(). There's some preference if data can be a single string, but it's ok to be in any format, as long as it's exactly what BetFactory.extract_bets() expect. BetFactory should create all HTMLParsers subclasses instances needed and use them to parse raw data. BetFactory.extract_bets() should return a sequence (or a generator if possible) with Bet instances for each line parsed. HTML parsing should be done using Python standard library HTMLParser module, unless dealing with buggy html pages and something else is needed. HTTP access should be done using urllib2, urllib, and clientcookie modules. External modules should be avoided as much as possible. Some styling rules: each HTML page should have it's own parser implemented on a [login to view URL] subclass don't use from x import * paying anywhere from $30-$70 a site depending on the site difficulty and how fast the code can be written

27978 html parsing in python

N/A

N/A

Om projektet

Leder du efter muligheder for at tjene penge?

Fordele ved budafgivning på Freelancer

Om klienten

Klientverificering

Andre jobs fra denne klient

Lignende jobs