Prev Next

1. The "Harvesting" Phase

Filtering software companies, given their limited resources, do not attempt to index or classify all of the billions of pages that exist on the Web. Instead, the set of pages that they attempt to examine and classify is restricted to a small portion of the Web. The companies use a variety of automated and manual methods to identify a universe of Web sites and pages to "harvest" for classification. These methods include: entering certain key words into search engines; following links from a variety of online directories (e.g., generalized directories like Yahoo or various specialized directories, such as those that provide links to sexually explicit content); reviewing lists of newly-registered domain names; buying or licensing lists of URLs from third parties; "mining" access logs maintained by their customers; and reviewing other submissions from customers and the public. The goal of each of these methods is to identify as many URLs as possible that are likely to contain content that falls within the filtering companies' category definitions.

The first method, entering certain keywords into commercial search engines, suffers from several limitations. First, the Web pages that may be "harvested" through this method are limited to those pages that search engines have already identified.

However, as noted above, a substantial portion of the Web is not even theoretically indexable (because it is not linked to by any previously known page), and only approximately 50% of the pages that are theoretically indexable have actually been indexed by search engines. We are satisfied that the remainder of the indexable Web, and the vast "Deep Web," which cannot currently be indexed, includes materials that meet CIPA's categories of visual depictions that are obscene, child pornography, and harmful to minors. These portions of the Web cannot presently be harvested through the methods that filtering software companies use (except through reporting by customers or by observing users' log files), because they are not linked to other known pages. A user can, however, gain access to a Web site in the unindexed Web or the Deep Web if the Web site's proprietor or some other third party informs the user of the site's URL. Some Web sites, for example, send out mass email advertisements containing the site's URL, the spamming process we have described above.

Second, the search engines that software companies use for harvesting are able to search text only, not images. This is of critical importance, because CIPA, by its own terms, covers only "visual depictions." 20 U.S.C. Sec. 9134(f)(1)(A)(i); 47 U.S.C. Sec.

254(h)(5)(B)(i). Image recognition technology is immature, ineffective, and unlikely to improve substantially in the near future. None of the filtering software companies deposed in this case employs image recognition technology when harvesting or categorizing URLs. Due to the reliance on automated text analysis and the absence of image recognition technology, a Web page with sexually explicit images and no text cannot be harvested using a search engine. This problem is complicated by the fact that Web site publishers may use image files rather than text to represent words, i.e., they may use a file that computers understand to be a picture, like a photograph of a printed word, rather than regular text, making automated review of their textual content impossible. For example, if the Playboy Web site displays its name using a logo rather than regular text, a search engine would not see or recognize the Playboy name in that logo.

In addition to collecting URLs through search engines and Web directories (particularly those specializing in sexually explicit sites or other categories relevant to one of the filtering companies' category definitions), and by mining user logs and collecting URLs submitted by users, the filtering companies expand their list of harvested URLs by using "spidering" software that can "crawl" the lists of pages produced by the previous four methods, following their links downward to bring back the pages to which they link (and the pages to which those pages link, and so on, but usually down only a few levels).

This spidering software uses the same type of technology that commercial Web search engines use.

While useful in expanding the number of relevant URLs, the ability to retrieve additional pages through this approach is limited by the architectural feature of the Web that page-to-page links tend to converge rather than diverge. That means that the more pages from which one spiders downward through links, the smaller the proportion of new sites one will uncover; if spidering the links of 1000 sites retrieved through a search engine or Web directory turns up 500 additional distinct adult sites, spidering an additional 1000 sites may turn up, for example, only 250 additional distinct sites, and the proportion of new sites uncovered will continue to diminish as more pages are spidered.

These limitations on the technology used to harvest a set of URLs for review will necessarily lead to substantial underblocking of material with respect to both the category definitions employed by filtering software companies and CIPA's definitions of visual depictions that are obscene, child pornography, or harmful to minors.

2. The "Winnowing" or Categorization Phase

Once the URLs have been harvested, some filtering software companies use automated key word analysis tools to evaluate the content and/or features of Web sites or pages accessed via a particular URL and to tentatively prioritize or categorize them.

This process may be characterized as "winnowing" the harvested URLs. Automated systems currently used by filtering software vendors to prioritize, and to categorize or tentatively categorize the content and/or features of a Web site or page accessed via a particular URL operate by means of (1) simple key word searching, and (2) the use of statistical algorithms that rely on the frequency and structure of various linguistic features in a Web page's text. The automated systems used to categorize pages do not include image recognition technology.

All of the filtering companies deposed in the case also employ human review of some or all collected Web pages at some point during the process of categorizing Web pages. As with the harvesting process, each technique employed in the winnowing process is subject to limitations that can result in both overblocking and underblocking.

First, simple key-word-based filters are subject to the obvious limitation that no string of words can identify all sites that contain sexually explicit content, and most strings of words are likely to appear in Web sites that are not properly classified as containing sexually explicit content. As noted above, filtering software companies also use more sophisticated automated classification systems for the statistical classification of texts. These systems assign weights to words or other textual features and use algorithms to determine whether a text belongs to a certain category. These algorithms sometimes make reference to the position of a word within a text or its relative proximity to other words. The weights are usually determined by machine learning methods (often described as "artificial intelligence"). In this procedure, which resembles an automated form of trial and error, a system is given a "training set" consisting of documents preclassified into two or more groups, along with a set of features that might be potentially useful in classifying the sets. The system then "learns" rules that assign weights to those features according to how well they work in classification, and assigns each new document to a category with a certain probability.

Notwithstanding their "artificial intelligence" description, automated text classification systems are unable to grasp many distinctions between types of content that would be obvious to a human. And of critical importance, no presently conceivable technology can make the judgments necessary to determine whether a visual depiction fits the legal definitions of obscenity, child pornography, or harmful to minors.

Finally, all the filtering software companies deposed in this case use some form of human review in their process of winnowing and categorizing Web pages, although one company admitted to categorizing some Web pages without any human review.

SmartFilter states that "the final categorization of every Web site is done by a human reviewer." Another filtering company asserts that of the 10,000 to 30,000 Web pages that enter the "work queue" to be categorized each day, two to three percent of those are automatically categorized by their PornByRef system (which only applies to materials classified in the pornography category), and the remainder are categorized by human review.

SurfControl also states that no URL is ever added to its database without human review.

Human review of Web pages has the advantage of allowing more nuanced, if not more accurate, interpretations than automated classification systems are capable of making, but suffers from its own sources of error. The filtering software companies involved here have limited staff, of between eight and a few dozen people, available for hand reviewing Web pages. The reviewers that are employed by these companies base their categorization decisions on both the text and the visual depictions that appear on the sites or pages they are assigned to review. Human reviewers generally focus on English language Web sites, and are generally not required to be multi-lingual.

Given the speed at which human reviewers must work to keep up with even a fraction of the approximately 1.5 million pages added to the publicly indexable Web each day, human error is inevitable. Errors are likely to result from boredom or lack of attentiveness, overzealousness, or a desire to "err on the side of caution" by screening out material that might be offensive to some customers, even if it does not fit within any of the company's category definitions. None of the filtering companies trains its reviewers in the legal definitions concerning what is obscene, child pornography, or harmful to minors, and none instructs reviewers to take community standards into account when making categorization decisions.

Perhaps because of limitations on the number of human reviewers and because of the large number of new pages that are added to the Web every day, filtering companies also widely engage in the practice of categorizing entire Web sites at the "root URL," rather than engaging in a more fine-grained analysis of the individual pages within a Web site. For example, the filtering software companies deposed in this case all categorize the entire Playboy Web site as Adult, Sexually Explicit, or Pornography. They do not differentiate between pages within the site containing sexually explicit images or text, and for example, pages containing no sexually explicit content, such as the text of interviews of celebrities or politicians. If the "root" or "top-level" URL of a Web site is given a category tag, then access to all content on that Web site will be blocked if the assigned category is enabled by a customer.

In some cases, whole Web sites are blocked because the filtering companies focus only on the content of the home page that is accessed by entering the root URL. Entire Web sites containing multiple Web pages are commonly categorized without human review of each individual page on that site. Web sites that may contain multiple Web pages and that require authentication or payment for access are commonly categorized based solely on a human reviewer's evaluation of the pages that may be viewed prior to reaching the authentication or payment page.

Because there may be hundreds or thousands of pages under a root URL, filtering companies make it their primary mission to categorize the root URL, and categorize subsidiary pages if the need arises or if there is time. This form of overblocking is called "inheritance," because lower-level pages inherit the categorization of the root URL without regard to their specific content. In some cases, "reverse inheritance" also occurs, i.e., parent sites inherit the classification of pages in a lower level of the site. This might happen when pages with sexual content appear in a Web site that is devoted primarily to non-sexual content. For example, N2H2's Bess filtering product classifies every page in the Salon.com Web site, which contains a wide range of news and cultural commentary, as "Sex, Profanity," based on the fact that the site includes a regular column that deals with sexual issues.

Blocking by both domain name and IP address is another practice in which filtering companies engage that is a function both of the architecture of the Web and of the exigencies of dealing with the rapidly expanding number of Web pages. The category lists maintained by filtering software companies can include URLs in either their human-readable domain name address form, their numeric IP address form, or both. Through "virtual hosting" services, hundreds of thousands of Web sites with distinct domain names may share a single numeric IP address. To the extent that filtering companies block the IP addresses of virtual hosting services, they will necessarily block a substantial amount of content without reviewing it, and will likely overblock a substantial amount of content.

Another technique that filtering companies use in order to deal with a structural feature of the Internet is blocking the root level URLs of so-called "loophole" Web sites. These are Web sites that provide access to a particular Web page, but display in the user's browser a URL that is different from the URL with which the particular page is usually associated. Because of this feature, they provide a "loophole" that can be used to get around filtering software, i.e., they display a URL that is different from the one that appears on the filtering company's control list. "Loophole" Web sites include caches of Web pages that have been removed from their original location, "anonymizer" sites, and translation sites.

Caches are archived copies that some search engines, such as Google, keep of the Web pages they index. The cached copy stored by Google will have a URL that is different from the original URL. Because Web sites often change rapidly, caches are the only way to access pages that have been taken down, revised, or have changed their URLs for some reason. For example, a magazine might place its current stories under a given URL, and replace them monthly with new stories. If a user wanted to find an article published six months ago, he or she would be unable to access it if not for Google's cached version.

Some sites on the Web serve as a proxy or intermediary between a user and another Web page. When using a proxy server, a user does not access the page from its original URL, but rather from the URL of the proxy server. One type of proxy service is an "anonymizer." Users may access Web sites indirectly via an anonymizer when they do not want the Web site they are visiting to be able to determine the IP address from which they are accessing the site, or to leave "cookies" on their browser.

Some proxy servers can be used to attempt to translate Web page content from one language to another. Rather than directly accessing the original Web page in its original language, users can instead indirectly access the page via a proxy server offering translation features.

As noted above, filtering companies often block loophole sites, such as caches, anonymizers, and translation sites. The practice of blocking loophole sites necessarily results in a significant amount of overblocking, because the vast majority of the pages that are cached, for example, do not contain content that would match a filtering company's category definitions.

Filters that do not block these loophole sites, however, may enable users to access any URL on the Web via the loophole site, thus resulting in substantial underblocking.

3. The Process for "Re-Reviewing" Web Pages After Their Initial Categorization Most filtering software companies do not engage in subsequent reviews of categorized sites or pages on a scheduled basis. Priority is placed on reviewing and categorizing new sites and pages, rather than on re-reviewing already categorized sites and pages. Typically, a filtering software vendor's previous categorization of a Web site is not re-reviewed for accuracy when new pages are added to the Web site. To the extent the Web site was previously categorized as a whole, the new pages added to the site usually share the categorization assigned by the blocking product vendor. This necessarily results in both over- and underblocking, because, as noted above, the content of Web pages and Web sites changes relatively rapidly.

In addition to the content on Web sites or pages changing rapidly, Web sites themselves may disappear and be replaced by sites with entirely different content. If an IP address associated with a particular Web site is blocked under a particular category and the Web site goes out of existence, then the IP address likely would be reassigned to a different Web site, either by an Internet service provider or by a registration organization, such as the American Registry for Internet Numbers, see http://www.arin.net. In that case, the site that received the reassigned IP address would likely be miscategorized.

Because filtering companies do not engage in systematic re-review of their category lists, such a site would likely remain miscategorized unless someone submitted it to the filtering company for re-review, increasing the incidence of over- and underblocking.

This failure to re-review Web pages primarily increases a filtering company's rate of overblocking. However, if a filtering company does not re-review Web pages after it determines that they do not fall into any of its blocking categories, then that would result in underblocking (because, for example, a page might add sexually explicit content).

3. The Inherent Tradeoff Between Overblocking and Underblocking

There is an inherent tradeoff between any filter's rate of overblocking (which information scientists also call "precision") and its rate of underblocking (which is also referred to as "recall"). The rate of overblocking or precision is measured by the proportion of the things a classification system assigns to a certain category that are appropriately classified. The plaintiffs' expert, Dr. Nunberg, provided the hypothetical example of a classification system that is asked to pick out pictures of dogs from a database consisting of 1000 pictures of animals, of which 80 were actually dogs. If it returned 100 hits, of which 80 were in fact pictures of dogs, and the remaining 20 were pictures of cats, horses, and deer, we would say that the system identified dog pictures with a precision of 80%. This would be analogous to a filter that overblocked at a rate of 20%.

The recall measure involves determining what proportion of the actual members of a category the classification system has been able to identify. For example, if the hypothetical animal- picture database contained a total of 200 pictures of dogs, and the system identified 80 of them and failed to identify 120, it would have performed with a recall of 40%. This would be analogous to a filter that underblocked 60% of the material in a category.

In automated classification systems, there is always a tradeoff between precision and recall. In the animal-picture example, the recall could be improved by using a looser set of criteria to identify the dog pictures in the set, such as any animal with four legs, and all the dogs would be identified, but cats and other animals would also be included, with a resulting loss of precision. The same tradeoff exists between rates of overblocking and underblocking in filtering systems that use automated classification systems. For example, an automated system that classifies any Web page that contains the word "sex"

as sexually explicit will underblock much less, but overblock much more, than a system that classifies any Web page containing the phrase "free pictures of people having sex" as sexually explicit.

This tradeoff between overblocking and underblocking also applies not just to automated classification systems, but also to filters that use only human review. Given the approximately two billion pages that exist on the Web, the 1.5 million new pages that are added daily, and the rate at which content on existing pages changes, if a filtering company blocks only those Web pages that have been reviewed by humans, it will be impossible, as a practical matter, to avoid vast amounts of underblocking.

Techniques used by human reviewers such as blocking at the IP address level, domain name level, or directory level reduce the rates of underblocking, but necessarily increase the rates of overblocking, as discussed above.

To use a simple example, it would be easy to design a filter intended to block sexually explicit speech that completely avoids overblocking. Such a filter would have only a single sexually explicit Web site on its control list, which could be re-reviewed daily to ensure that its content does not change. While there would be no overblocking problem with such a filter, such a filter would have a severe underblocking problem, as it would fail to block all the sexually explicit speech on the Web other than the one site on its control list. Similarly, it would also be easy to design a filter intended to block sexually explicit speech that completely avoids underblocking. Such a filter would operate by permitting users to view only a single Web site, e.g., the Sesame Street Web site. While there would be no underblocking problem with such a filter, it would have a severe overblocking problem, as it would block access to millions of non-sexually explicit sites on the Web other than the Sesame Street site.

While it is thus quite simple to design a filter that does not overblock, and equally simple to design a filter that does not underblock, it is currently impossible, given the Internet's size, rate of growth, rate of change, and architecture, and given the state of the art of automated classification systems, to develop a filter that neither underblocks nor overblocks a substantial amount of speech. The more effective a filter is at blocking Web sites in a given category, the more the filter will necessarily overblock. Any filter that is reasonably effective in preventing users from accessing sexually explicit content on the Web will necessarily block substantial amounts of non- sexually explicit speech.

4. Attempts to Quantify Filtering Programs' Rates of Over- and Underblocking The government presented three studies, two from expert witnesses, and one from a librarian fact witness who conducted a study using Internet use logs from his own library, that attempt to quantify the over- and underblocking rates of five different filtering programs. The plaintiffs presented one expert witness who attempted to quantify the rates of over- and underblocking for various programs. Each of these attempts to quantify rates of over- and underblocking suffers from various methodological flaws.

The fundamental problem with calculating over- and underblocking rates is selecting a universe of Web sites or Web pages to serve as the set to be tested. The studies that the parties submitted in this case took two different approaches to this problem. Two of the studies, one prepared by the plaintiffs' expert witness Chris Hunter, a graduate student at the University of Pennsylvania, and the other prepared by the defendants' expert, Chris Lemmons of eTesting Laboratories, in Research Triangle Park, North Carolina, approached this problem by compiling two separate lists of Web sites, one of URLs that they deemed should be blocked according to the filters' criteria, and another of URLs that they deemed should not be blocked according to the filters' criteria. They compiled these lists by choosing Web sites from the results of certain key word searches. The problem with this selection method is that it is neither random, nor does it necessarily approximate the universe of Web pages that library patrons visit.

The two other studies, one by David Biek, head librarian at the Tacoma Public Library's main branch, and one by Cory Finnell of Certus Consulting Group, of Seattle, Washington, chose actual logs of Web pages visited by library patrons during specific time periods as the universe of Web pages to analyze. This method, while surely not as accurate as a truly random sample of the indexed Web would be (assuming it would be possible to take such a sample), has the virtue of using the actual Web sites that library patrons visited during a specific period. Because library patrons selected the universe of Web sites that Biek and Finnell's studies analyzed, this removes the possibility of bias resulting from the study author's selection of the universe of sites to be reviewed. We find that the Lemmons and Hunter studies are of little probative value because of the methodology used to select the sample universe of Web sites to be tested. We will therefore focus on the studies conducted by Finnell and Biek in trying to ascertain estimates of the rates of over- and underblocking that takes place when filters are used in public libraries.

The government hired expert witness Cory Finnell to study the Internet logs compiled by the public libraries systems in Tacoma, Washington; Westerville, Ohio; and Greenville, South Carolina. Each of these libraries uses filtering software that keeps a log of information about individual Web site requests made by library patrons. Finnell, whose consulting firm specializes in data analysis, has substantial experience evaluating Internet access logs generated on networked systems.

He spent more than a year developing a reporting tool for N2H2, and, in the course of that work, acquired a familiarity with the design and operation of Internet filtering products.

The Tacoma library uses Cyber Patrol filtering software, and logs information only on sites that were blocked. Finnell worked from a list of all sites that were blocked in the Tacoma public library in the month of August 2001. The Westerville library uses the Websense filtering product, and logs information on both blocked sites and non-blocked sites. When the logs reach a certain size, they are overwritten by new usage logs. Because of this overwriting feature, logs were available to Finnell only for the relatively short period from October 1, 2001 to October 3, 2001. The Greenville library uses N2H2's filtering product and logs both blocked sites and sites that patrons accessed. The logs contain more than 500,000 records per day. Because of the volume of the records, Finnell restricted his analysis to the period from August 2, 2001 to August 15, 2001.

Finnell calculated an overblocking rate for each of the three libraries by examining the host Web site containing each of the blocked pages. He did not employ a sampling technique, but instead examined each blocked Web site. If the contents of a host Web site or the pages within the Web site were consistent with the filtering product's definition of the category under which the site was blocked, Finnell considered it to be an accurate block. Finnell and three others, two of whom were temporary employees, examined the Web sites to determine whether they were consistent with the filtering companies' category definitions. Their review was, of course, necessarily limited by: (1) the clarity of the filtering companies' category definitions; (2) Finnell's and his employees' interpretations of the definitions; and (3) human error. The study's reliability is also undercut by the fact that Finnell failed to archive the blocked Web pages as they existed either at the point that a patron in one of the three libraries was denied access or when Finnell and his team reviewed the pages. It is therefore impossible for anyone to check the accuracy and consistency of Finnell's review team, or to know whether the pages contained the same content when the block occurred as they did when Finnell's team reviewed them. This is a key flaw, because the results of the study depend on individual determinations as to overblocking and underblocking, in which Finnell and his team were required to compare what they saw on the Web pages that they reviewed with standard definitions provided by the filtering company.

Tacoma library's Cyber Patrol software blocked 836 unique Web sites during the month of August. Finnell determined that 783 of those blocks were accurate and that 53 were inaccurate.

The error rate for Cyber Patrol was therefore estimated to be 6.34%, and the true error rate was estimated with 95% confidence to lie within the range of 4.69% to 7.99%. Finnell and his team reviewed 185 unique Web sites that were blocked by Westerville Library's Websense filter during the logged period and determined that 158 of them were accurate and that 27 of them were inaccurate. He therefore estimated the Websense filter's overblocking rate at 14.59% with a 95% confidence interval of 9.51% to 19.68%. Additionally, Finnell examined 1,674 unique Web sites that were blocked by the Greenville Library's N2H2 filter during the relevant period and determined that 1,520 were accurate and that 87 were inaccurate. This yields an estimated overblocking rate of 5.41% and a 95% confidence interval of 4.33% to 6.55%.

Finnell's methodology was materially flawed in that it understates the rate of overblocking for the following reasons.

First, patrons from the three libraries knew that the filters were operating, and may have been deterred from attempting to access Web sites that they perceived to be "borderline" sites, i.e., those that may or may not have been appropriately filtered according to the filtering companies' category definitions.

Second, in their cross-examination of Finnell, the plaintiffs offered screen shots of a number of Web sites that, according to Finnell, had been appropriately blocked, but that Finnell admitted contained only benign materials. Finnell's explanation was that the Web sites must have changed between the time when he conducted the study and the time of the trial, but because he did not archive the images as they existed when his team reviewed them for the study, there is no way to verify this. Third, because of the way in which Finnell counted blocked Web sites i.e., if separate patrons attempted to reach the same Web site, or one or more patrons attempted to access more than one page on a single Web site, Finnell counted these attempts as a single block, see supra note 10 his results necessarily understate the number of times that patrons were erroneously denied access to information.

At all events, there is no doubt that Finnell's estimated rates of overblocking, which are based on the filtering companies' own category definitions, significantly understate the rate of overblocking with respect to CIPA's category definitions for filtering for adults. The filters used in the Tacoma, Westerville, and Greenville libraries were configured to block, among other things, images of full nudity and sexually explicit materials. There is no dispute, however, that these categories are far broader than CIPA's categories of visual depictions that are obscene, or child pornography, the two categories of material that libraries subject to CIPA must certify that they filter during adults' use of the Internet.

Finnell's study also calculated underblocking rates with respect to the Westerville and Greenville Libraries (both of which logged not only their blocked sites, but all sites visited by their patrons), by taking random samples of URLs from the list of sites that were not blocked. The study used a sample of 159 sites that were accessed by Westerville patrons and determined that only one of them should have been blocked under the software's category definitions, yielding an underblocking rate of 0.6%. Given the size of the sample, the 95% confidence interval is 0% to 1.86%. The study examined a sample of 254 Web sites accessed by patrons in Greenville and found that three of them should have been blocked under the filtering software's category definitions. This results in an estimated underblocking rate of 1.2% with a 95% confidence interval ranging from 0% to 2.51%.

Report error

If you found broken links, wrong episode or any other problems in a anime/cartoon, please tell us. We will try to solve them the first time.

Email:

SubmitCancel

Share