Uncertainty Online
This work was the first step in a larger research project aimed at reducing
the uncertainty that Internet users encounter when trying to judge the
validity and intent of information they find online. This particular
piece of the project was centered around classifying webpages by type.
For the purposes of this project, we selected 7 types of webpages
available online: news, scholarly, commercial, forums, personal, organization,
and blogs.
This project was a collaboration
with
Jen Mankoff and
Haakon Faste.
Data Collection
The training and testing data for this project was comprised of approximately
370 links. These links were collected using the Delicious API. I also used
the API to grab the tags associated with each link. Each link was then
processed using a a number of python scripts. One script broke the url into
its component parts, and another scraped the page to capture page sections
like the title and body text, and the last pulled features, such as the
frequency of scientific language and the ratio of 1st to 2nd person
pronouns, from the page's html.