HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Parsing Wikipedia table with Python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
withwikipediaparsingpythontable

Problem

I am new to Python and recently started exploring web crawling. The code below parses the S&P 500 List Wikipedia page and writes the data of a specific table into a database.

While this script is hardcoded and I would certainly be interested in some thoughts on performing the same task in a slightly more generic way (perhaps with beautifulsoup), this is not my primary concern. What I really wondered was if there is a less verbose or more "pythonic" way of doing it.

```
import urllib.request
import re
import pymysql

# Open Website and get only the table on the page with the relevant data. In this hardcoded case
table = urllib.request.urlopen("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies#S.26P_500_Component_Stocks").read().decode("utf-8")
table = table.split("(.+)'
cigs_grab = '^(.+)'

ticker, exchange, names, cigs, cigs_sub = ( [] for i in range(5))
match = False

# Parse HTML output and write relevant td data to lists.
# The list is "hardcoded", meaning after each match of either NASDAQ or NYSE ident,
# the matching as well as the next, the fourth and fifth after that one get parsed.

for i in range(len(table)):
if bool(re.search(pattern = tick_ident_nasdaq, string = table[i])):
ticker.append(re.search(pattern = name_grab, string = table[i]).group(1))
exchange.append("NASDAQ")
match = True

elif bool(re.search(pattern = tick_ident_nyse, string = table[i])):
ticker.append(re.search(pattern = name_grab, string = table[i]).group(1))
exchange.append("NYSE")
match = True

if match == True:
names.append(re.search(pattern = name_grab, string = table[i + 1]).group(1))
names[-1] = re.sub(pattern = "&", repl = "&", string = names[-1])
cigs.append(re.search(pattern = cigs_grab, string = table[i + 3]).group(1))
cigs[-1] = re.sub(pattern = "&", repl = "&", string = cigs[-1])
cigs_sub.append(re.search(pattern = cigs_grab, string = table[i + 4]).g

Solution

The right tool

As you've said, you are not using the right tool for this task : you can't parse HTML with regexps.

A better approach would be to use an already existing parser like BeautifulSoup.

A simpler container

At the moment, you are putting data in multiple lists to zip them all at the very end. It can be a very nice technique but in our case, you put in different containers things that actually belong together. Also, you have a risk of adding too many elements in a list and having information zipped with information that should be in a different row. An easier option is to have a single list where each elements contain everything you've parsed.

Also, you can take this chance to rewrite in a more straightforward way the parts where you add something a list and then refer to it with my_list[-1].

company_data = []

for i in range(len(table)):
    if bool(re.search(pattern = tick_ident_nasdaq, string = table[i])):
        exchange = "NASDAQ"
    elif bool(re.search(pattern = tick_ident_nyse, string = table[i])):
        exchange = "NYSE"
    else:
        exchange = None
    if exchange:
        ticker = re.search(pattern = name_grab, string = table[i]).group(1)
        name = re.search(pattern = name_grab, string = table[i + 1]).group(1)
        name = re.sub(pattern = "&", repl = "&", string = name)
        cig = re.search(pattern = cigs_grab, string = table[i + 3]).group(1)
        cig = re.sub(pattern = "&", repl = "&", string = cig)
        cig_sub = re.search(pattern = cigs_grab, string = table[i + 4]).group(1)
        cig_sub = re.sub(pattern = "&", repl = "&", string = cig_sub)
        company_data.append((ticker, exchange, name, cig, cig_sub))


Compile your regexp

You can compile regexp if you plan to reuse them many times. It is more efficient and it makes it possible to use them like any Python object.

# Define regex used for parsing
tick_ident_nasdaq = re.compile('href=\"http:\/\/www\.nasdaq\.com\/symbol\/')
tick_ident_nyse = re.compile('href=\"https:\/\/www.nyse.com\/quote\/')
name_grab = re.compile('\">(.+)')
cigs_grab = re.compile('^(.+)')
amp_re = re.compile("&")

company_data = []

for i in range(len(table)):
    if bool(tick_ident_nasdaq.search(string = table[i])):
        exchange = "NASDAQ"
    elif bool(tick_ident_nyse.search(string = table[i])):
        exchange = "NYSE"
    else:
        exchange = None
    if exchange:
        ticker = name_grab.search(string = table[i]).group(1)
        name = name_grab.search(string = table[i + 1]).group(1)
        name = amp_re.sub(repl = "&", string = name)
        cig = cigs_grab.search(string = table[i + 3]).group(1)
        cig = amp_re.sub(repl = "&", string = cig)
        cig_sub = cigs_grab.search(string = table[i + 4]).group(1)
        cig_sub = amp_re.sub(repl = "&", string = cig_sub)
        company_data.append((ticker, exchange, name, cig, cig_sub))


"&" and "&"

What you are trying to do when substituing "&" with "&":

-
deserves to but put in a function on its own

-
actually corresponds to a common problem already solved : HTML entity decoding.

Code Snippets

company_data = []

for i in range(len(table)):
    if bool(re.search(pattern = tick_ident_nasdaq, string = table[i])):
        exchange = "NASDAQ"
    elif bool(re.search(pattern = tick_ident_nyse, string = table[i])):
        exchange = "NYSE"
    else:
        exchange = None
    if exchange:
        ticker = re.search(pattern = name_grab, string = table[i]).group(1)
        name = re.search(pattern = name_grab, string = table[i + 1]).group(1)
        name = re.sub(pattern = "&", repl = "&", string = name)
        cig = re.search(pattern = cigs_grab, string = table[i + 3]).group(1)
        cig = re.sub(pattern = "&", repl = "&", string = cig)
        cig_sub = re.search(pattern = cigs_grab, string = table[i + 4]).group(1)
        cig_sub = re.sub(pattern = "&", repl = "&", string = cig_sub)
        company_data.append((ticker, exchange, name, cig, cig_sub))
# Define regex used for parsing
tick_ident_nasdaq = re.compile('href=\"http:\/\/www\.nasdaq\.com\/symbol\/')
tick_ident_nyse = re.compile('href=\"https:\/\/www.nyse.com\/quote\/')
name_grab = re.compile('\">(.+)<\/a></td>')
cigs_grab = re.compile('^<td>(.+)</td>')
amp_re = re.compile("&amp;")

company_data = []

for i in range(len(table)):
    if bool(tick_ident_nasdaq.search(string = table[i])):
        exchange = "NASDAQ"
    elif bool(tick_ident_nyse.search(string = table[i])):
        exchange = "NYSE"
    else:
        exchange = None
    if exchange:
        ticker = name_grab.search(string = table[i]).group(1)
        name = name_grab.search(string = table[i + 1]).group(1)
        name = amp_re.sub(repl = "&", string = name)
        cig = cigs_grab.search(string = table[i + 3]).group(1)
        cig = amp_re.sub(repl = "&", string = cig)
        cig_sub = cigs_grab.search(string = table[i + 4]).group(1)
        cig_sub = amp_re.sub(repl = "&", string = cig_sub)
        company_data.append((ticker, exchange, name, cig, cig_sub))

Context

StackExchange Code Review Q#156350, answer score: 11

Revisions (0)

No revisions yet.