Saturday 3 February 2018

How to manage Internet Content Blocking: some practicalities






“The Internet contains some deeply troublesome and harmful material. The main commercial players are both immensely rich and immensely clever – they must be able to do more to find solutions. If they don’t we politicians will prosecute/fine/tax them until they behave responsibly.” So goes the refrain but what can we reasonably expect of the available technologies? Here is a guide for campaigners.

Issue 1: what criteria are you applying for blocking “undesirable” material?
To those who haven’t thought about the issue it seems obvious what needs to be blocked. Almost anyone other than the most extreme libertarian will point to material which they find distressing or harmful – and be able to produce justifying arguments. But if you are asking a computer program or a human being to make decisions there has to be greater clarity. In almost all circumstances there will be countervailing arguments about freedom of speech,  freedom of expression and censorship.  
The easiest policy to implement is where one can point to existing legislation defining specifically illegal content. For example in the United Kingdom possession of indecent images of children is a strict liability offence[i]. Published guidelines from the Sentencing Council describe in detail three levels of offence in terms of age and specific activities[ii]. Similarly “extreme pornography” is clearly defined – essentially animals,  dead people and absence of consent[iii]. But outside that particular context there is no definition of “extremism” still less of “harmful”.[iv] Successive would-be legislators have struggled because so often the appearance of a particular document or file depends not only on its content but on its context. 
A simple example: let’s take two statements: “the state of Israel is a theft from Palestinians” and “the state of Israel is entitled to occupy all the territories mentioned in the Bible”. Are these statements, which many people would label “extreme”,  simply expressions of history and religious belief? Do we have a different view of them if they are accompanied by a call to action – push all the Jews out, push out all the Arabs? The boundaries are unclear and it seems unreasonable that if legislators are unwilling to provide assistance that somehow Internet companies should be forced to make those decisions. There is a separate further issue for the biggest of the global companies in that judgements about extremism and harmfulness vary across jurisdictions and cultures.
It gets more difficult with “grooming” whether for a sexual purpose or to incite terrorist acts.  The whole point of grooming is that its starts low key and then builds. It is  easy enough to identify grooming after a successful exercise[v] but how do you distinguish the early stages from ordinary conversation?  And how do you do so via a computer program or a human monitor? 
Finally, it is even more difficult to think what the evidence would look like where the enforceable law simply says: social media sites should keep children safe.


Issue 2: what is the legal framework within which material gets uploaded?
Material gets uploaded to the Internet via a variety of legal frameworks and this has an impact on where potential legal enforcement can be directed.  An individual might buy web space from an Internet service provider and create their own website. That same individual may provide facilities for third parties to post comments which will then be automatically instantly seen by all visitors. A social media service will almost certainly require a specific sign up from their subscribers/members and at that time inform them of an “acceptable use” or “community standards” policy but will thereafter allow postings without prior approval or initial restraint.
The position currently taken by most Internet service companies, bolstered by various directives and laws is that they are not publishers in the same sense as traditional media such as newspapers magazines and broadcast television stations. They say that they are providing facilities but are not editors. Or that they are “data processors” as opposed to “data controllers”[vi].  The claim is that they are “intermediaries” for the purpose of the E-Commerce Directive and Regulations. These arguments are currently being hotly debated. But even under their interpretation there is a significant impact on what one can reasonably expect them to do in terms of attempting to block before publication.
The main business of Google is to index world wide web content which has been originated by others with whom it has no contractual relationship. It has a series of “crawler” programs which scavenge the open part of the World Wide Web; the findings are then indexed and that is what visitors to Google’s main pages see. The contractual relationship that is most important in the basic Google framework is with those who use the indexes – essentially the service is paid for by allowing Google to harvest information about individuals which can be turned into targeted advertising. But Google is not under any compulsion or contractual obligation to index anything;  it can block at will.   The main policy reason for refusing to block is that it has decided that it favours completeness and freedom of speech and expression; it blocks only when there is an overwhelming reason to do so. 
By contrast for Facebook, Twitter, and many similar services the contractual relationship is with their customers/subscribers/members. It is consists of saying “we will let you see what others have posted and we will let you post provided you will allow us to harvest information about you and send you targeted advertisements”.  As part of the contract there is usually an Acceptable Use or Community Standards provision which are the basis for blocking. But here again as companies headquartered in the United States they are concerned about observing First Amendment rights[vii]
There are important differences in terms of what one can expect if some of this material is to be blocked. In the case of Google they have no opportunity to prevent material from being uploaded; the earliest point at which they could intervene is when their crawler comes across material which has already been published.  Their choice is to refuse to index.  But for the social media sites and where the acceptable use policy is part of the customer agreement the earliest opportunity for blocking is when the customer uploads material.

Issue 3: technical means for blocking material (a) that that has already been identified as “undesirable”.
We must now look at the various blocking technologies and see how far they are practical to implement. There is a significant difference between situations where material has already been identified by some method or other as requiring blocking and material which no one has so far seen and passed judgement on.
Blocking of known “undesirable” material (I am using the word “undesirable” to avoid the problems raised in Issue 1 above) is relatively straightforward though there are questions of how to do so at the speed and quantity of uploads. For example on Facebook, it is said  that every 60 seconds 510,000 comments are posted, 293,000 statuses are updated and 136,000 photos uploaded[viii].

It is trivially easy to block an entire website. The block is on the URL  -  www.nastysite.com -and this is the method traditionally used by such bodies as the Internet Watch Foundation and the National Center for Missing and Exploited Children. It is also possible, again by URL, to block part of the website -  www.harmlesssite.com /nastymaterial  - though here the blocking will fail if the folder containing the undesirable material is given a different name or location in the file structure of the website as a whole. One can extend this method to specific pages and pictures on the website – www.harmlesssite.com/harmless/nastyfile.jpg -  but here too simple name changes will render the blocking ineffective.
Blocking on the basis of keyword is impossibly crude.  “Sex” eliminates the counties of Sussex, Essex, Middlesex etc as well as much useful material on health, education, law enforcement and more. 
In order to overcome these problems one must revert to a different technology – file hashing. A file hash or fingerprint of a file is created using a simple program[ix] which is applied to the totality of a file – photo, picture, documents, software program – to produce a unique short sequence of numbers and letters. The program is clever enough so that the most purposes no two dissimilar files will ever produce the same hash or signature. A database of these hashes is built up and when a file is presented for examination a hash is created and compared with the database. If there is a match the newly uploaded file is then blocked. File hashing is used elsewhere throughout computing in order, for example, to demonstrate that a file has not been altered or that it has.
This method only works to identify absolutely identical files so that if an “undesirable” file has been slightly altered there will be a different hash and so blocking will not take place. To a limited extent there is also a further technology which deals with slightly dissimilar files. For photo images the most popular of these is called photoDNA[x] which is promoted by Microsoft and given away to Internet service providers , social media services and to law enforcement. There are two typical situations where it is effective – when a file has been subject to a degree of compression to reduce its size and where are there are a series of adjacent clips taken from a video.

Issue 4: technical means for blocking material (b) that is new and hasn’t been seen before.
This leaves the situation where a wholly new material never seen before is uploaded or where previously seen material has been substantially altered for example by cropping or selection. Here many claims are made for “artificial intelligence” techniques.
But most computer scientists as opposed to marketing droids no longer use the phrase “artificial intelligence” or its contraction “AI” because concepts of what it is keep on changing in the light of developments in computer science and investigations by biological scientists in how the human brain actually works. Moreover AI consists of a number of separate techniques all with their own value but also limitations. It can include pattern recognition in images, the identification of rules in what initially appears to be random data, data mining,  neural networks, and machine learning in which a program follows the behaviour of an individual or event and identifies patterns and linkages.  And there are more and there are also many overlaps in definitions and concepts. 
Much depends on what sort of results are hoped for. A scientist either operating in the physical or social sciences and possessed of large volumes of data may wish to have drawn to their attention possible patterns from which rules can be derived.  They may want to extend this into making predictions.  A social media company or retailer may wish to scan the activity of a customer in order to make suggestions for future purchases – but here high levels of accuracy are not particularly required. If an intelligence agency or law enforcement agency uses similar techniques to scan the activities of individual the level of inaccuracy may have unfortunate consequences – the decision to prevent that person from boarding an aeroplane or whether they secure future employment or whether they are arrested.
If one is scrutinising uploaded files, limitations become apparent. In the first place the context in which a file is being uploaded may be critical. Field Manuals from the United States Army[xi] were produced as part of the training mechanism for that organisation but they are also found on the computers of people suspected of terrorism. Terrorist manuals may be reproduced on research and academic websites on the basis that experts need to be able to refer and analyse them. The same photo may appear on a site promoted by a terrorist group and by a news organisation.  Some sexually explicit photos may be justified in the context of medical and educational research – or law enforcement. 
Beyond that, as we have already discussed, telling the difference between a document which merely advances an argument and one which incites may be beyond what is currently possible via AI. My favourite example of linguistic ambiguity is “I could murder an Indian” which might mean no more than one person is inviting another to a meal in an Indian restaurant. In terms of photos, how does one tell the difference between the depiction of a murderous terrorist act and a clip from a movie or computer game?  AI can readily identify a swasitka in an image - but is the photo historic and of Germany in the 1930s and during World War II,  or a still from a more modern war movie, or is it on a website devoted to neo-Nazi anti-semitism?    How do you reliably distinguish a 16-year-old from an 18-year-old, and for all ethnicities?  How does an AI system distinguish the artistic from the exploitative or when in a sexual situation there is an absence of consent?  What exactly is "fake news" and where are the generally-accepted guidelines to recognise it?
The role of AI techniques therefore is less that they can make fully automated decisions of their own and more that they can provide alerts for which human monitors will make a final arbitration. Even here there is a problem because as with most alert systems it is usually possible to set a threshold before something is brought to attention. A balance has to be struck between too many false positives – alerts which identify harmless events – and false negatives - failures to identify harmful activity.

Issue 5: the role and training of human monitors.
This takes us back to Issue 1. A human monitor has to make judgements based on criteria laid down by the organisation exercising blocking. That human monitor needs clear and consistent instructions and associated with them appropriate training. Among other things the blocking organisation will want to be able to demonstrate consistency in decisions.  As we have seen monitoring for illegality is easier than making judgements about “extremism” and “harm”. But even here the structure of many laws is that it is for a court to determine whether a crime has been committed. Where the test is purely of a factual nature – for example the age of a person in a sexual situation – the decision might be relatively simple. But whether somebody is to be convicted for disseminating terrorist material context may be critical – the academic researcher versus someone against whom there is also evidence of having sent funds to or has begun to accumulate the material necessary to build a bomb. 
As a result the human monitor can probably only block where they are absolutely sure that a court would convict – leaving a number of potential situations in which a court might possibly convict but the monitor decides that there is insufficient reason to block. At the Internet Watch Foundation which operates on a relatively limited remit  confined to illegal sexual material, decisions about marginal photos and files are usually taken by more than one person and may be referred upwards for special review.  
One policy problem in the counter-terrorism domain is that material which by itself is not illegal may nevertheless play a part in the radicalisation of an individual.  A striking recent example was a BBC drama based on events involving child abuse in the northern town of Rochdale which was said to have inspired a man to  murder a Muslim man and attack others in Finsbury Park, London.
Where are we to obtain appropriate human monitors? Facebook and similar organisations have announced that they plan to recruit 10,000 or more such persons. But there is no obvious source – this is not a role which exists in employment exchanges or in the universities. Almost inevitably a monitor will spend most of their day looking at deeply unpleasant and distressing material – even if you can persuade people to assume such a role it is plainly important to establish that they have the intellectual ability and psychological make up to be able to cope and perform.  Current indications are that monitors are recruited in countries that possess a population of graduates but where regular employment for them is very limited and hourly rates are low.  It also looks as though the monitors are not directly employed by the social media sites but by third-party out-sourcing companies such as Accenture.[xii]  If true this could be aimed at limiting the liability of the major social media sites.  Moreover, and again one looks at the experience of the Internet Watch Foundation, employers have a duty of care as damage to the monitor as well as their effectiveness may develop over time. One must also ask what sort of career progression such a monitor can expect.

Observations
Too often those who dislike what they see “on the Internet” spend all their energy in drawing attention to the various harms and neglect to consider in sufficient detail which remedies might have a practical impact. 
As this article has tried to show criteria for blocking have to be clear and unambiguous whether the blocking is carried out by human monitors, computer programs or a combination thereof. There will always be a substantial territory at the margins where there are disputes.
Fully automated computer-mediated blocking is high risk because AI is nowhere near sufficiently sophisticated to achieve results which most people will accept. There is a useful mantra: Blocking is good and censorship is bad.

So given that obvious harms exist on the Internet:  what practical routes are available now?
One of them,  popular with campaigners, is to emulate Germany and its Netzwerkdurchsetzungsgesetz - NetzDG for short. This requires the biggest social networks - those with more than two million German users - to take down "blatantly illegal" material within 24 hours of it being reported. For less obvious material, seven days’ consideration is allowed. Fines for violation could be up to 50 million euros.  At the time of writing there have been no cases.  But this law seems to be limited to situations where there is existing law describing illegality, not to further instances of extremism and harm.

There are a number of existing UK laws which address situations which are less than full-on sexual and terrorism offences, for example the sending by an adult of a sexually explicit picture to a child and the various preparatory terrorist activities in the Terrorist Act 2006 -  “encouragement”,  dissemination of materials,  raising funds, arranging and attending training events.

The NSPCC proposes a Code of Practice which it says should be mandatory[xiii] but many of  their detailed proposals lack the specificity which is required if there is to be legal enforcement – “safeguarding children effectively – including preventative measures to protect children from abuse” is simply the articulation of a desirable policy aim. However there is much to be said for campaigning for a voluntary code, violation of which would be an opportunity for public shaming.

This takes us to a proposal which is in some respects contentious but which merits further examination:  much higher personal identity verification standards before admitting people to accounts on social media.  This would involve processes similar to those required in opening an online bank account – birth certificates,  passports,  possibly signatures from trusted individuals to sign off on some-one’s identity.  Such an approach would do much to prevent under-age individuals from joining unsuitable services and stop others from seeking to post anonymously or via a fake identity.  Just as gun laws do not wholly stop the circulation of illegal firearms such measures would reduce though not eliminate grooming, hate speech and fake news.  At the least higher personal identity verification standards would make it much easier to identify fake identities and identities which are bots as opposed to real people. But there will be opposition from privacy advocates who will argue that in some countries dissent is difficult to publish unless there is anonymity.

But higher personal identity verification standards would have to be imposed globally and not just in the UK in order to close off obvious evasion routes – and both the public and the major social media sites would need to be persuaded that the advantages outweigh the loss of convenience and privacy. 





[i] S 160 Criminal Justice Act 1988
[ii] https://www.sentencingcouncil.org.uk/offences/item/possession-of-indecent-photograph-of-child-indecent-photographs-of-children/
[iii] sections 63-67 of the Criminal Justice and Immigration Act 2008
[iv] https://www.theguardian.com/uk-news/2017/sep/17/paralysis-at-the-heart-of-uk-counter-extremism-policy
[v] Indeed under s 67 Serious Crime Act 2015 it is an offence for an adult to send a sexually explicit message to a child
[vi] See for example:  https://inforrm.org/2017/11/12/cjeu-advocate-general-opines-on-the-definition-of-a-data-controller-applicable-national-law-and-jurisdiction-under-data-protection-law-henry-pearce/
[vii] http://constitutionus.com/; https://www.law.cornell.edu/constitution/first_amendment
[viii] Cited by https://zephoria.com/top-15-valuable-facebook-statistics/ though there are other statistics and it is difficult to know which to credit.
[ix] Such as MD5 or from the SHA family
[x] https://www.microsoft.com/en-us/photodna;  https://en.wikipedia.org/wiki/PhotoDNA
[xi] https://www.loc.gov/rr/frd/Military_Law/pamphlets_manuals.html
[xii] https://www.thetimes.co.uk/article/facebook-fails-to-delete-hate-speech-and-racism-hwrzw0qzn; https://www.thetimes.co.uk/article/meet-the-internet-moderators-b86t2lrlv; ttps://www.washingtonpost.com/news/the-intersect/wp/2017/05/04/the-work-of-monitoring-violence-online-can-cause-real-trauma-and-facebook-is-hiring/?utm_term=.4d0a47b56d12; https://www.wsj.com/articles/the-worst-job-in-technology-staring-at-human-depravity-to-keep-it-off-facebook-1514398398;http://www.dailymail.co.uk/news/article-4548898/Facebook-young-Filipino-terror-related-material-Manchester.html
[xiii] https://www.nspcc.org.uk/what-we-do/news-opinion/more-than-1300-cases-sexual-communication-with-child-recorded-after-change-law/