Thursday, April 21, 2011

Long-term access to information

One of the most famous quotes of our times is "Information is power". It doesn't take a theoretical computer scientist to tell you that, yet many were required to formalize it in something called "space complexity" and still a lot are researching related issues.Although it is intuitive that having more storage, thus being able to store larger amounts of information, allows you to solve more problems, computer scientists have proven an increase more than constant in the amount of tape allows you to solve at least one problem that you could not before. To the readers that are not familiar with space complexity but wish to know more, "space complexity" , "PSPACE" and "space hierarchy theorem" are good search terms to begin with.

One can project the theory and notions of space complexity to people in their working environment. The more storage space you are able to access, the better. Imagine if you could one have only 10 pages of text stored in your computer at a time. This would greatly hinder your ability to cross reference and validate information. However, while a Turing Machine can access its tape at anytime, this is not true for your documents as well. Allow me to explain with a short story.

Despite my young age, I have been using computers for two decades now. Then, I needed to ask my dad to access my data and applications (prehistoric video games!), so that he gave me the correct floppy disk and he typed the commands necessary, at least until I had seen him enough times to do it myself. And yes, I typed my first computer command long before writing my first sentence but I didn't have the remotest clue what I was doing. I simply knew that I had to press the same symbols as those written on the paper stuck on the side of the screen so that I would play my favorite video game. This memory is very exciting for me because it reminds me of how Turing machines do not know what they compute,yet they compute it. It becomes even more exciting when I remember that as I kept using this commands and observing the system's response, I started to realize what I was doing. The first command I mastered that way was "dir". It listed all the next options of the "cd" command , that allowed me to access different games. And so on. This resembles unsupervised learning, where a computer program learns something by studying a lot of examples. I am already writing a post on comparing human with machine learning.

Philosophical discussion of computation aside, there is an important aspect of the above story. Like I needed my father's intervention to access the floppies, you need an application that can comprehend the format your data is encoded in. If for some reason you do not have access to an application with that ability, your data is useless to you. Without someone that knows how to read a language (an encoding in this case) , any text in this language is very hard or impossible to read. Perhaps you've tried to open a Microsoft Office document (text, presentation or other) in another office suit and the results were disatisfactory? Most likely it wasn't that the developers were not good enough, but that they were damn good enough that they managed to open that document, even if partial, by using reverse engineering. This task is equivalent to decrypting old military codes. Having a lot of examples, you try enough techniques to open a document , until you get close to the desired result. How close you will get is determined by the obscurity of the encoding and the number of techniques you will try.


However, file formats, which is what you call the encoding rules for a certain document, are meant to be directions to write and read a document, not encrypt it. Encryption should be independent of the encoding, since the file format is equivalent to a language's grammar and syntax, not the cipher or technique you will use to encrypt a document. Recognizing the user's need to be able to read and edit his documents despite its choice of application and operating system, certain software vendors have developed open file formats. Those are file formats with public available specifications, which are developed by anyone interested with anyone interested. Usually, an alliance of vendors together with independent programmers develop the format, then release a number of implementations that any vendor can use in their applications to read and edit the file format. An open file format is an open standard. You can read a brief definition of open standards here: http://documentfreedom.org/2011/os.en.html

Open file formats include:
  • The ODF (Open Document Format) family, which serves typical office suite documents . Openoffice and Libreoffice are among the applications that natively support ODF. Microsoft Office has added support for ODF since 2010. I recommend you use ODF instead of the Office Open XML Microsoft format, which is, contrary to popular belief, closed. While it is publicly available, Microsoft has patents on the format and may make demands on developers that implement it. On the other hand, IBM and Sun Microsystems have resigned from this right, making ODF a fully open standard.
  • HTML , XML , PHP , RSS and other file formats used on the Web and the Internet. Openess is an important aspect of the growth and development of the Web. And while we are at it, the TCP and IP protocols are open standards as well.
  • SVG and PNG, used to store images.
  • PostScript, Latex , DVI, PDF. Adobe opened the PDF format back in 2008. If only the same was done with Flash. Oh well, HTML5 is around the corner. You may use HTML5 on some youtube videos, by visiting http://www.youtube.com/html5 and having the latest version of your browser (I recommend Firefox 4).
  • Unicode, UTF-8 , ASCII  and others.

As you can see, open standards are not an obscure idea of user computers that, while correct, has seen small adoption (yes, Linux, I am looking at you). Hundreds of major companies support it (see http://www.odfalliance.org/) , it is not tied to a certain business model for software (proprietary/free/open) or a specific operating system. It is an idea as essential as the invention of typography and will probably play a major role in the future, as digital documents become the norm. As typography allowed anyone to read books that in the Middle Ages only rich people that could buy expensive copies prepared by monks could afford, open standards will allow anyone, despite his preferences and his budget, to use digital documents and participate equally in the digital community and e-government. Your to-do list written in a notepad application may not be worth archiving, but state laws, historic documents and other important files are. Those are documents that we want to be able to access, regardless of applications and vendors, in decades from now. Furthermore, you want to be able to access your personal documents in the years to come, especially if they are work related or have special emotional value.

I considered this issue important enough that I organized an event and a presentation, along with my friend and colleague George. We presented it on the ACM AUTH Student Chapter , you can find the GREEK presentation here: https://docs.google.com/present/edit?id=0AVbfU6r_pWRdZGZ0NDZ4OWpfNGhrczVmdGQ3&hl=en&authkey=CO629-II

I hope you understood how important open standards and documents are. I urge all of you, regardless of philosophy on software, to use open standards.