Python Web Scraping -Beautiful Soup
I record web scraping my learning in this story and want to apply it on scraping some Hong Kong news website because I worked in advertising job before and this is familiar with me. This tutorial scraps timejobs.com.
Here are some points I think it is important
Read html
first thing to read html file by using ‘lxml’ format
soup = BeautifulSoup(html_text, 'lxml')
Job posts
Several jobs posts in timesjob.com are in the page. Let’t inspect it.
Each job post are under a tag <li>and a class. We use soup.find_all to scrap each job post.
jobs = soup.find_all('li', class_ ='clearfix job-bx wht-shd-bx')
Scraping comp name
company name inside <h3> tag and class joblist-comp-name
company_name = job.find('h3', class_ = 'joblist-comp-name').text.replace(' ', '')
Scraping more info
In <h2>, you can see a link inside it. The link redirects the job post details. If we click the job post, we will see more details.
<a> tag are inside h2. href means Hypertext Reference
more_info = job.header.h2.a['href']