Hypothesis: Precisely define a high-level logic of interscreen interaction, and maybe a graphical implementation for it will somehow fall into place, especially if the high-level logic can be grafted easily onto already widely used open data schemes (RSS? social graphs? etc.).
The selection of a precise point or line on a screen, by clicking or dragging, implies that 1) the person has focused attention disproportionately on that area, and 2) some effect will happen connected to that particular area.
Beyond some primitive level of development of graphical interfaces, no two screenshots seem likely ever to display the same information and options.
So, how to dynamically and digitally represent the everchanging screen info to the cloud, and the everchanging cloud to the screen?
We recognize that everything will mean different things to different people, etc. We'll want to get the cloud to draw various inferences from our input, and intelligently distribute the information.
My expectations for the effects of my clicks and drags include complex factors of social interaction, when the interface on which I'm clicking and dragging integrates tightly into the cloud. I imagine such interfaces allowing for new modes of interaction with millions of minds, with vastly accelerated rates of feedback, such that we actually feel the reality of the dynamic existence of these millions of other people, from second to second, in new ways. I imagine that looking back from this perspective, the resolution of older media such as television, non-interactive videos, telephone conversations, etc. will appear nearly intolerably low.
Most socially useful software development has become increasingly open and collaborative.
Social discourse
Software development and much of our political, commercial, and legal discourse, when such discourse is organized around a few specific terms and negotiations over the terms' definitions and/or values (interest rates, legal definitions, legal rulings, negotiated legal/political settlements, etc.), seem like good early applications of the new type of interface. If these areas can be successfully absorbed into the new medium, vast amounts of drudgery, carried out by specialists, will evaporate as millions worldwide collaborate with comparatively awesome efficiency to solve new abstract problems arising in the cloud. Of course, institutions with long legacies of doing business through previous media, and with secretive habits, will probably not be among the first to adopt the new, open-to-the-world interface. But if it begins to fulfill its revolutionary promise among early adopters, the legacy systems will obviously integrate into it, as they have with the Web, with Twitter, etc. And since the new interface potentially subsumes and integrates all functions of current GUI software, it can be expected to virally infect and take over any abstract symbolic function of any existing institution.
Thousands of internet forums seem to prove the socio-moral-psychological viability of the idea of crowdsourcing the larger part of our abstract technical, commercial, legal, and political work. (This essay seeks to demonstrate its technical viability.) People at stackoverflow.com constantly hand out valuable technical advice to anyone who can coherently formulate a question. People seem to love to help, even with relatively boring problems, especially in noncoercive high-feedback environments where the impact of the help can be clearly felt and seen.
It seems helpful to face squarely at this juncture the question: what do we mean by clicks and drags? It seems beneficial to assume that the interface will take into account only the beginning and ending positions of drags, not the path between them. (This assumption carries over from most software we currently use.) With that assumption, we seem to have a fairly complete [complete enough? (for?)] structural picture of the basic input and output actions: a person looks at a screen, which displays pixels with occasionally changing colors, moves a pointer, or moves the contents of the screen over/around a pointer, and occasionally clicks or drags. The clicks may be brief enough that the hardware does not register the exact duration, but just associates the selection with a particular point or pixel on the screen at a particular time. A drag will involve two such associations, at the beginning and the end.
Then, we can ask how to program each interface so that the cloud as a whole intelligently distributes pixel colors? Defining "intelligently" here may appear at first as the prohibitively difficult question. But we have already assumed that people are deliberately selecting certain portions of their screens -- which of course we have been doing for several years now, maybe often with some kind of dull or reckless attitude, but maybe also, at other times, with some kind of actual creativity or intelligence. We have been injecting more and more intelligence into the cloud, such that the cloud can now be made intelligent enough to share the intelligence in new, radically decentralized ways, spreading it optimally among the interfaces.
An interface will contain some kind of idea about the meanings of our selections, as it will have already applied some criteria in determining what to display where and when on the screen. In our second-to-second deliberations as we navigate through this highly interactive environment, we will quickly learn to take the reactions of others into account. So my expectations regarding the effect of making a given selection includes considerations about 1) how it will affect my screen, and 2) how it will affect other people's screens.
[Perhaption X: Perhaps the interface/cloud can somehow, on some level, digitally encode any given "screenshot" by giving relative numerical values for the strength of the node's connections at that time to certain other nodes, or by simply listing a finite number of other nodes.]
As the screen pixels get dynamically patched together by the cloud, they will manifest various levels of "noise", with some more "fuzzy" areas where we can make out less definite meaning. Selecting such areas may indicate some meaningful connections between nearby patches containing greater internal coherence. The interface may 1) have already assigned different "fuzziness" values to different areas of the screen, so that it can decide which areas of coherence the person intends to indicate, or 2) calculate these values when the person makes a selection.
Perhaption X implies that whenever a person makes a selection, the node will recalculate its strength values and/or node list. But maybe instead of thinking about each "instant" of a given screen corresponding to a list of strength values or nodes, we can think about each selection as corresponding to such a list, and thus constituting a kind of "virtual node", a node not corresponding to a particular screen, but to a given choice made on a given screen.
Each virtual node connects to other virtual nodes. Here we may hopefully get into the crucial, tricky, precisely definable, recursive situation. Which other nodes does it connect to?
Let M represent a cluster of N nodes on a single user interface, in the center of which I make a selection. The interface will choose a value for N high enough to give valuable information about the selection, but low enough that it can handle the calculations, and will then look at the N closest nodes to the selection in formulating the weighted list of node conections for that selection (node). This list may include some of M or M in its entirety, and also remote nodes.
In order to generate M for the first time, before any selections within the new interface have been made, we will probably create tools that look at a person's blogs, email, social contacts through various social networks, online and offline bookmarks, hard disk contents, etc. These tools may prove useful for migrating our old information onto the cloud, but the new interfaces will then take over the generation and distribution of new content.
Assume that when people focus their attention on a particular area of the screen, they also poise their input device over/around that area, so increased attention given to an area correlates with increased probability of selecting it.