Thursday, April 21, 2011

Long-term access to information

One of the most famous quotes of our times is "Information is power". It doesn't take a theoretical computer scientist to tell you that, yet many were required to formalize it in something called "space complexity" and still a lot are researching related issues.Although it is intuitive that having more storage, thus being able to store larger amounts of information, allows you to solve more problems, computer scientists have proven an increase more than constant in the amount of tape allows you to solve at least one problem that you could not before. To the readers that are not familiar with space complexity but wish to know more, "space complexity" , "PSPACE" and "space hierarchy theorem" are good search terms to begin with.

One can project the theory and notions of space complexity to people in their working environment. The more storage space you are able to access, the better. Imagine if you could one have only 10 pages of text stored in your computer at a time. This would greatly hinder your ability to cross reference and validate information. However, while a Turing Machine can access its tape at anytime, this is not true for your documents as well. Allow me to explain with a short story.

Despite my young age, I have been using computers for two decades now. Then, I needed to ask my dad to access my data and applications (prehistoric video games!), so that he gave me the correct floppy disk and he typed the commands necessary, at least until I had seen him enough times to do it myself. And yes, I typed my first computer command long before writing my first sentence but I didn't have the remotest clue what I was doing. I simply knew that I had to press the same symbols as those written on the paper stuck on the side of the screen so that I would play my favorite video game. This memory is very exciting for me because it reminds me of how Turing machines do not know what they compute,yet they compute it. It becomes even more exciting when I remember that as I kept using this commands and observing the system's response, I started to realize what I was doing. The first command I mastered that way was "dir". It listed all the next options of the "cd" command , that allowed me to access different games. And so on. This resembles unsupervised learning, where a computer program learns something by studying a lot of examples. I am already writing a post on comparing human with machine learning.

Philosophical discussion of computation aside, there is an important aspect of the above story. Like I needed my father's intervention to access the floppies, you need an application that can comprehend the format your data is encoded in. If for some reason you do not have access to an application with that ability, your data is useless to you. Without someone that knows how to read a language (an encoding in this case) , any text in this language is very hard or impossible to read. Perhaps you've tried to open a Microsoft Office document (text, presentation or other) in another office suit and the results were disatisfactory? Most likely it wasn't that the developers were not good enough, but that they were damn good enough that they managed to open that document, even if partial, by using reverse engineering. This task is equivalent to decrypting old military codes. Having a lot of examples, you try enough techniques to open a document , until you get close to the desired result. How close you will get is determined by the obscurity of the encoding and the number of techniques you will try.


However, file formats, which is what you call the encoding rules for a certain document, are meant to be directions to write and read a document, not encrypt it. Encryption should be independent of the encoding, since the file format is equivalent to a language's grammar and syntax, not the cipher or technique you will use to encrypt a document. Recognizing the user's need to be able to read and edit his documents despite its choice of application and operating system, certain software vendors have developed open file formats. Those are file formats with public available specifications, which are developed by anyone interested with anyone interested. Usually, an alliance of vendors together with independent programmers develop the format, then release a number of implementations that any vendor can use in their applications to read and edit the file format. An open file format is an open standard. You can read a brief definition of open standards here: http://documentfreedom.org/2011/os.en.html

Open file formats include:
  • The ODF (Open Document Format) family, which serves typical office suite documents . Openoffice and Libreoffice are among the applications that natively support ODF. Microsoft Office has added support for ODF since 2010. I recommend you use ODF instead of the Office Open XML Microsoft format, which is, contrary to popular belief, closed. While it is publicly available, Microsoft has patents on the format and may make demands on developers that implement it. On the other hand, IBM and Sun Microsystems have resigned from this right, making ODF a fully open standard.
  • HTML , XML , PHP , RSS and other file formats used on the Web and the Internet. Openess is an important aspect of the growth and development of the Web. And while we are at it, the TCP and IP protocols are open standards as well.
  • SVG and PNG, used to store images.
  • PostScript, Latex , DVI, PDF. Adobe opened the PDF format back in 2008. If only the same was done with Flash. Oh well, HTML5 is around the corner. You may use HTML5 on some youtube videos, by visiting http://www.youtube.com/html5 and having the latest version of your browser (I recommend Firefox 4).
  • Unicode, UTF-8 , ASCII  and others.

As you can see, open standards are not an obscure idea of user computers that, while correct, has seen small adoption (yes, Linux, I am looking at you). Hundreds of major companies support it (see http://www.odfalliance.org/) , it is not tied to a certain business model for software (proprietary/free/open) or a specific operating system. It is an idea as essential as the invention of typography and will probably play a major role in the future, as digital documents become the norm. As typography allowed anyone to read books that in the Middle Ages only rich people that could buy expensive copies prepared by monks could afford, open standards will allow anyone, despite his preferences and his budget, to use digital documents and participate equally in the digital community and e-government. Your to-do list written in a notepad application may not be worth archiving, but state laws, historic documents and other important files are. Those are documents that we want to be able to access, regardless of applications and vendors, in decades from now. Furthermore, you want to be able to access your personal documents in the years to come, especially if they are work related or have special emotional value.

I considered this issue important enough that I organized an event and a presentation, along with my friend and colleague George. We presented it on the ACM AUTH Student Chapter , you can find the GREEK presentation here: https://docs.google.com/present/edit?id=0AVbfU6r_pWRdZGZ0NDZ4OWpfNGhrczVmdGQ3&hl=en&authkey=CO629-II

I hope you understood how important open standards and documents are. I urge all of you, regardless of philosophy on software, to use open standards.

Sunday, March 6, 2011

Networks and unnecessary complexity

Computer networks are one of the most important inventions of the previous century. Being able to convey information to any other computer in the network is a powerful ability.Note that having computers as the node is what truly distinguishes them from other kinds of communication networks: Every other kind of network requires a human at some point, either to transmit the information further (post system, telegrams) or to receive the information (telephones). The computer can be used both as a retransmitter , usually in the form of a router, as well as not only receive, but also store the information, so that the human user can retrieve it when he wishes to.

I was recently forced to study networks, using the book "Computer Networks" by Andrew S. Tanenbaum (an excellent starting level book on networks by the way). Although I am not an expert in networks, far from it, discussions with fellow students with network expertise convinced me that this book covers almost all known network aspects, even if at an introductory level. So I consider my network adventures a kind of BFS and frankly, do not wish to go further. If you are still wondering how can someone be "forced" to study , it is possible if you combine a vast amount of reading material for the exam with the  flu. I am sure most of you are familiar with the concept of spending time on a task that is irrelevant to your own goals, but it is mandatory.

I was not a stranger to computer networks, but this time there was a different feeling. Unintentionally, I was applying my computational complexity background on this rediscovered knowledge. Of course, I already knew some of the clashes between the two areas. For example, I am very well aware that big-oh notation is not tolerated in networks. Constants can make a great difference in the performance of a routing or collision management algorithm. This was not news to me and it hardly surprised me as I read chapter by chapter, the exam clock ticking away. However, another issue popped up.

Another, much more important goal of mine is my diploma project, necessary for my graduation. It involves the time hierarchy theorem, so I am reading one of the foundations of computational complexity, the 1965 paper by Hartmanis and Stearns "On the computational complexity of algorithms" . A very important result of the paper is the following:

 Given any computable problem, there is a function T(n) that upper bounds its complexity.

There is perhaps an even stronger statement to be added: That this function is at most the running time of its brute-force algorithm.

Even if the second sentence is not true, this is a very important result. Given any computable problem, we know that there is a maximum effort we have to make. For many problems, we have algorithms that allow us to reduce that effort by a lot. For other algorithms, we do not know if such algorithms, but we suspect that brute-force is the best way, that is one approach to view the P vs NP question.

And that is what surprised me with networks.  In many problems, unnecessary effort was put. First of all, there were way too many protocols, let's simplify that by saying too many algorithms, for exactly the same problem. I am trying to be as reasonable as possible here. For example,wireless transmission needs different algorithms than wired, simply because different issues arise. The same can be told for point to point communication and broadcasting. But we are talking about using different approaches with no specific advantages for the same problem. Those of you that have read the book, know that Tanenbaum admits to this fact and blames political and business agendas for that. I wholly agree with him on that fact.I feel that a handful of protocols could solve our networking problems, even with the different approaches , like connection-oriented vs connectionless. I believe that it is too late to reverse this trend. This beast of protocols is uncontrollabe and everyone has its say in how computers should communicate.

Even without the ocean of protocols, I still believe there is unnecessary complexity in networking today. The following example bugged me: My home computer is using the same Ethernet protocol that routers on a Google data center do. Using different algorithms for different input sizes is not uncommon. Circuit complexity is actually about using a different algorithm(circuit) for every input size. I am talking about something more trivial, having a bound on input size to decide which algorithm to use.I believe mergesort and quicksort are a common example: Usually, quicksort  is used only if the input size is equal or larger to 100, otherwise mergesort is used. Is it too much to ask that a connection from one computer to a router uses a different protocol from a large-scale enterprise LAN? Note that this can be part of the same protocol and does not require two different protocols, while it allows for better performance.

There are of course, other aspects of real life that I consider full of unneccesary complexity , although not always in the sense of computational complexity but morelike system complexity. You can read more about complexity and its different aspects in wikipedia for the time being : http://en.wikipedia.org/wiki/Complexity  . Perhaps I will have a post on this subject. I believe that state bureaucracy and legal procedures are fine examples of unnecessary complexity, although it is easy to fall in the trap of using inefficient procedures for difficult problems. As with most problems, there is a thin balance one has to maintain and it is not easy at all to do that.

I hope you enjoyed this post and that it was thought provoking. Do you know of any examples of unnecessary complexity? Do you believe it is possible to reverse a trend of a multitude of solutions and if so, do you have any specific approach?

PS: I hope I can make another post in a smaller interval, the start of the spring semester did not allow me to blog as often as I would like.

Friday, February 18, 2011

Hello world!

Hello and welcome to this new blog. I hope you will enjoy reading this blog as much as I will enjoy writing it. So, let me make the necessary introductions.Today I want to talk about why I felt the need to start this blog, from both mine and the community's perspective.

First off, a little bit about me. I am a soon-to-graduate undergraduate at the Computer Science department of the Aristotle University of Thessaloniki. While the majority of my fellow classmates hope to work as software engineers, I have different plans. After my graduation, any party interested in my full details of my degree, will see that I specialised in Information Systems. However, that is only half the truth. The other half is that in my spare time I have been working in TCS, especially in complexity theory. Thankfully, I was able to work on a diploma project in complexity theory, that I hope will reflect my studies of theory.

Anyway, enough about me. As the title suggests, this blog is about Theoretical Computer Science or TCS.  Although I guess that it would be most interesting to the "theory people", you do not need a degree to follow this blog. If I wanted to write about TCS in a strict way, I would write a book. Instead, I believe that there are 3 main directions relative to this blog's content:
  1. Comment on non-CS stuff, like a certain aspect of everyday life or a story that unveiled recently. Explore how does it relate to TCS, whether we can learn from it and if TCS could be applied in the given scenario. I believe this would be most interesting to people who do not know much about theory. Examples could be : "Zero knowledge proofs and the Cold War" or "Recent disaster : could algorithm A help the allocation of humanitarian aid?"
  2. Stop by other areas of CS and take a look from the theory perspective. Preferably, these areas would be fairly large to be interesting enough and the talk will be a skin-deep approach. I do not want to talk about something few people know about or go into greater depth than it is needed. The goal would be to see how this area benefits from theory and vice versa and explore if there are any conflicting views.Example posts could be "What P=NP would mean for artificial intelligence?" or "Is there any place for theory in software engineering?"
  3. Talk about unexpected results and insightful techniques used in TCS. Again, these would resemble more informal discussions in a conference corridor than a written survey. Topics could include: What was the driving force behind the researchers that made a discovery? How this result can help solve other open problems and expand our knowledge? Can this technique be applied to the given problem? I expect these kind of posts to be the most popular among people from the TCS community.
In my work, I am mostly interested in computational complexity and I guess that as time goes by, perhaps this blog will become more complexity-oriented. Perhaps not, only time will tell. Either way, this blog is built on the assumption that its readers know nothing more than a freshman does. Unknown concepts will be explained clearly but not exhaustively. The purpose is to spark the reader's interest; if a reader wishes to learn more, there are other, far better resources. Mathematics and theorems will be kept to a "need-to-know" basis. That means that when needed, proof sketches will be given and not full proofs. Again, a book or a paper will provide a much better full proof than I will.

 When I was picking out a name for this blog, I thought of YATB: Yet Another Theory Blog. They are many blogs out there. So what was my motivation for this blog? There are two reasons that convinced me to start this project:


1. There are many theory blogs, some of them very popular in the TCS community. I follow a good number regularly and it has been very beneficial to me. However, all of them have something in common. People who write them were known in the community beforehand. Some of them are tenured and have brilliant careers to share, others are in a tenure-track position with promising careers ahead and strong publications. Why someone would read their blogs needs no explanation. They are leaders of our community. So what about me? I have attended a single TCS conference, FOCS 2010. I have zero publications (but not for long), although I have authored a number of unpublished technical reports.
So why bother making a new blog? Because I believe I have something unique to offer: The journey towards becoming a TCS researcher and hopefully, a good one.If other blogs are pictures of people on the peak of Mount Everest, I want to make a documentary about climbing the whole thing, from the bottom to whatever height my strength and the circumstances take me. It sounds interesting to me, no?

2.  Most of the experienced blog authors have already specialized. I have not and I aim to write about CS in general, even if it is from a theory perspective. . Sure, complexity might be more frequent than others, mostly because I feel more safe writing about it and it will probably take less time, but I aim in covering as much of CS as possible. I have already decided what my first two posts will be, networks and robotics respectively.
Another issue is that some blogs are aimed at a more experienced audience. Its not uncommon when I read some TCS blogs to start writing on my whiteboard, as if I was attending a lecture. Some blogs are written with this purpose. However, I prefer giving the basis for the subject at hand so that everyone can  grasp its essense. The ones interested in it can go on to look for more resources, perhaps to some that I will link to.

Before I finish this first post, there is a final issue. I expect a lot of this blog's readers to be Greek. Being Greek myself, perhaps writing in Greek would be easier for them and me. However, I don't want to exclude anyone from this blog. Given that all Greeks are fluent in English, i'd rather tire myself a bit more and provide a version that is readable by as many people as possible.

Now that the foundations of our communication have been settled, we can talk about the good stuff. See you in a few days, when I will give my thoughts on computer networks.