Global Internet Monitoring Project
1. Who are the users or target customers of your project, and what have you learned from them so far?
Our audience is very broad, but can be split into three categories: 1) researchers and data analysts, 2) programmers and developers, and 3) journalists, policy makers and the general public. The boundaries that delineate these categories roughly correspond to the level of granularity of the datasets to be consumed.
Researchers, analysts: raw data.
The raw, unfiltered data is most likely to be useful for individuals and groups working in research. They could have very specific criteria requirements such as the need to analyze a precise network event, coming from a set of countries within given time intervals. Conversely, the raw data should also allow for original experiments pertinent to a respective researcher's field. For example, social scientists might attempt correlating data from the Global Internet Monitoring Project with sentiment analysis results from prior research. Chokepoint Project and the Tor Project will be releasing the raw data generated as it becomes available in a bulk format. We will use this input to feed Application Programming Interfaces (APIs) described below.
Programmers, developers: structured data, APIs.
Programmers, developers and other groups building tools requiring structured data make up the second targeted users group. APIs are abstraction layers meant to remove most of the processing required to generate meaningful data for a given use case. This data will be predictable in its format, periodically updated, clearly organized and documented according to specifications made public. Typical use cases for this level of data granularity would be a resource endpoint providing data about blocked website domains in a country or a list of backends that appear to manipulate headers sent from our probes. Chokepoint and Tor will be leveraging the raw data gathered to create simple APIs for potential data consumers.
General public, journalists, policy makers : data applications.
By creating compelling visualizations, condensed reports and rich diagrams from API data, one can communicate a strong message. These tools are popular among journalists looking for effective ways to contextualize facts. Policy makers can make more informed decisions if they are presented with solid data in accessible form. The general public should also have access to these tools and be able to contribute by creating their own easily. Chokepoint and Tor will lead by example by creating sample application that demonstrate the use of our produced structured data.
An important lesson we have learned from our target groups is the importance of involvement at each stage of the data publication process, providing useful tools along the way. It it also quite common to observe scientists, journalists or researchers produce papers and articles without releasing the raw data alongside it. In this respect, we hope to set an example by exposing our methodology and raw data to public scrutiny in the hopes of providing what could be considered a "best practice" in the field of internet monitoring data analysis.
2. What assumptions are you making in what you propose, and how will you test them?The core approach of this project is NOT to assume, but to test. That having been said there are the fundamental assumptions that:
A) The events to be tested for with the probes do in fact take place in certain countries. (based on previous reports, projects, etc.)
B) The events to be tested will change over time.
C) These events are detrimental to free speech, net neutrality, equal access to information and the like.
D) Gathering this test data in an ongoing fashion will provide hard fact as to the technical state of affairs.
E) Public dissemination of this data both in its raw format and in digested form will raise awareness and positively impact a large number of people.
F) Understanding both means and content of interference will improve successful mitigation of this interference.
Responding to this question more technically the following can be said:
A) Different network tests have different assumptions. For this reason all of the test results always need to be cross referenced with results from a network vantage point that does not perform network filtering.
B) There is a wide range of tests, should the underlying assumption of one prove false, others are likely to compensate so that collectively there will be actionable information generated.
C) Disseminating probes will require supporting those in the best position to run these probes.
D) Analysis of generated raw data will not happen by itself.
E) Digest results have to be presented at the applicable level of understanding of the target audience.
Test specific assumptions of the various ooni-probe tests can be found here: https://github.com/TheTorProject/ooni-spec/tree/master/test-specs.
3. How will you get your project in front of the necessary people or organizations?By:
A) Leveraging our existing network of people and organizations. This network covers a wide area of interest and expertise and corresponds to the intended audiences, including media, human rights defenders, free speech advocates, industry, academia and policy makers.
B) Coordinating a media campaign to promote the tool using both the existing publicity channels, such as the Tor blog and by soliciting participation from the aforementioned network of existing contacts.
4. What are the obstacles to implementing your idea, and how will you address them?A) Getting people from the countries we are interested in to run the tool.
The approach to address this issue is to perform outreach and provide significant support to "on-the-ground" partner organizations. This activity should be made somewhat easier by "productizing" the probes on raspberry pi´s, thereby decreasing the level of technical acumen and consequent time spent to run a probe.
B) Having relevant inputs to the probes, such as country specific domain lists.
Some very good work addressing this issue has recently been made available by Citizen Lab: https://github.com/citizenlab.
In addition to this, we will be disseminating a survey to our on-the-ground partners to further expand the available input sets. Unfortunately any input set will have a limited time-span. Addressing this longevity issue is not a focus at this time, but the problem is known.
C) Automated analyses are limited in that it is very difficult to determine causality.
There will be no attempts at determining causality and any statement of fact as a result of such automated analyses will be treated with the utmost suspicion.
D) Visualizations are inherently biased, its narrative power risks presenting a false reality.
As with the analytic data underpinning these visualizations, statement of fact should be treated with the greatest suspicion. The approach to mitigate this issue is to be very conservative in regards to what narrative might be interpreted from any visual representation of the data analytics.
E) Data processing, infrastructure, security and methodologies.
Once the probes have generated reports and sent them back for analysis (raw data), most expected challenges will relate to the manipulation of aforementioned data. Known obstacles such as data transport, data security, data publishing methods, data anonymizing and data processing intersect the realms of expertise found in both Chokepoint and Tor projects' team members. Our experience in building reliable, secure systems to handle data at scale will be leveraged to design pipelines to coordinate data traffic. We have extensive experience in producing successful production systems, from conceptualizing high-level interactions between different moving parts to implementing the code that make them run smoothly.
5. How much do you think your project will cost, and what are the major expenses?Based on a breakdown of activities and expenses the project is estimated to cost $ 402856
This breaks down as follows:
Analysis & development: 252300
Project Management & Project Support: 34250
System Administration: 26500
The resources break down as follows:
100 Raspberry pi´s 55200
Legal support 10000
Partner support (10x2000) 20000
Travel & stay (10x1500) 15000
Server HW or service equivalent (4*3500) 14000
Incidentals (50*100) 5000
Bandwidth (24*700) 16800
A more detailed budget and corresponding budget rationale is available and represents an “all features“ effort.
6. What other people or projects are working in this space, and what have you learned from them?This space has become joyfully crowded and any listing would not do it justice. That having been said:
There are various projects that are focusing more on network neutrality in general. Examples of this are NeuBot developed by the Nexa center, Glasnost by the Max Planck Institute, Project BISMark from Georgia Tech.
We are currently in contact and collaborating on some projects with the Nexa Center and Georgia Tech. Their experience in this field has proven useful in understanding what good deployment strategies are. For example the idea of project BISMark to give out home routers to people interested in contributing results (and their success with such a strategy) is the basis of our plan to ship rasperry pi devices to potential ooni-probe users.
More specifically aimed at internet censorship measurement there is: ONI and Herdict by the Berkman center. We have learned from these projects that it is important to provide the raw data of the measurement results in order to allow other people to base analyses on this work. The ONI project aimed at defining the standards for Government removal content has also taught us that it's very important to have a standard data format that is well specified.
The Oxford Internet Institute has done and is still doing good work on external probing of DNS poisoning, working with them has helped understand the differences between and respective limitations of manual versus automated analyses. As well as the benefits of ongoing bulk data for further research.
Measurement lab provides a fantastic infrastructural platform to run tests against and publish raw data. It is clear that as uptake increases, infrastructural requirements and problems in processing analytics results grow exponentially.
GreatFire has successfully focused on China and provides daily updates both about full domain blockages and Weibo content censorship. It provides a very good example that the technical means of censorship are increasingly sophisticated, further strengthening our conviction in the importance of ongoing, eventually global, and publicly available measurements.
In general there is a lot of excellent work being done out there by many great people and organizations, many of whom we praise ourselves lucky to count amongst our friends. Each has a slightly different target audience which helps limiting the problem set to be solved.
With the rapid growth of censorship and surveillance practices that directly or indirectly violate civil and human rights, it has become of vital importance to augment our incidental and anecdotal understanding of these practices with on-going, evidence-based reporting on what is actually happening on our networks. To achieve this requires a globally distributed network of standardized network measurement nodes, as well as powerful analysis and visualization tools.
We, the Tor project and Chokepoint Project, have over the past two years amassed extensive technical and domain-specific expertise on the detection, analysis and reporting of surveillance and censorship events. The Tor Project has been developing open standards, software and a methodology for conducting measurements. Chokepoint Project has been working on near real-time processing, analysis, visualization and contextualization of this type of data.
For this proposal, we aim to extend, improve and integrate the existing software systems and analysis tools, with the goal of enabling more comprehensive, evidence-based, and up-to-date reporting on censorship and surveillance events. Our proposal works towards this goal with a three-pronged approach:
1. Expand and improve Tor's ooni-probe software suite, which provides the basic infrastructure to support a globally distributed measurement network.
- Support for running ooniprobe on raspberry pi devices.
- Running tests periodically, making ooniprobe a system daemon.
- Support for remotely provisioning probes with tests and inputs to run based on their geographical location and ASN.
2. Integrate and enhance Chokepoint's data analysis and visualization tools, to incorporate and report on data from the ooniprobe software suite.
- Automated processing of ooniprobe yaml reports.
- Automated analysis of ooniprobe yaml reports.
- Automated collection of ooniprobe yaml reports
- Support for automated generation of analytics visualization and analytic data downloads.
3. Reach out to Tor's and Chokepoint's extensive list of contacts to plan the deployment of ooniprobes "on the ground", in a selected set of 10 to 20 countries.
- Survey creation and distribution to determine country specific internet use
- User feedback features
- Training material
- Plan for software distribution
Since no country is alike, and internet use is equally diverse, any measurement needs to be contextualized into a regional socio-political framework. Surveys will be distributed to on-the-ground partner organizations to construct a measurement methodology that yields culturally relevant results.