WLang (Part I)

As I've said previously (in the abstract of this paper), each time you write a simple html page (or a SQL query) manually, what you are actually doing is creating a (kind of) program, that will be interpreted by a brower (or by a database engine). If the page or the query is created by string concatenation what you are actually doing is generating code. Generating code is far from trivial: you have to respect a lot of good practices and conventions in order to build correct and secure programs. If you known perfectly what we mean by proper value encoding, query structure preserving, backquoting, and the like as well as the dramatic consequences that can occur by not applying them rigorously, then you are ready for enjoying wlang in this domain. In this case, you are invited to read the second part of this article. For the others, below is a short introduction to the problem that wlang solves. We mainly focus on security problems here but wlang helps solving code generation problems in a broader sense.

Hello World in PHP

First of all: you can try this example here.

Assume you've written a helloworld.php that invites the user to enter his name and simply says hello when the form is submitted:

<html>
  <head><title>Hello world in PHP</title></head>
  <body>
    <h1>Hello <?= $_POST['name'] ?></h1>
    <form action="helloworld.php" method="POST">
      <input type="text" name="name"/>
      <input type="submit" value="Submit"/>
    </form>
  </body>
</html>

What does PHP when it executes (or interprets) such a page (which is commonly called a server page, and similar examples can be found in JSP, ASP, Ruby On Rails, etc.) ? In the example at hand, it replaces the whole block<?= $POST['name'] ?> by what the user has entered in the form under the field called 'name' (which is the actual execution of the PHP code $POST['name']). The resulting page is then sent to the browser which interprets it and this interpretation leads to the rendering of the page on your display.

In fact, each time you submit the form, PHP generates a new page saying hello to the name you've entered; this page is a kind of program that must be interpreted by the browser for correct rendering. More generally, each time PHP creates a page in response to a request by inserting computed values (which come from a form, a database, a web service, etc.) inside plain text, it generates code...

  1. What is the intended behavior (or even, the specification) of the HTML page generated by PHP using the server page shown above?
  2. To display 'Hello ' followed by what the user has entered has name, as part of a <h1> tag.
  3. Put <script>alert('You are a fu..... m..... f..... !!')</script> instead or your name in the form.
  4. Your browser insults you and only 'Hello ' appears in the page.

If this behavior is not what is actually expected, we can say that the program does not respect its specification. And the program we are talking about is the one generated by PHP. Your server page is another program whose specification is (with help of PHP), to generate programs which have as specification to display Hello, ...'. Your server page does not either respect its specification, as it can generate programs that do not respect their own specification. As already said, generating code is far from trivial: a program that looks correct sometimes is not!

You probably know the solution to the problem here: enclosing $_POST['name'] in a htmlentities function call. What is true here, is also true for 95% of the inclusions of that kind: each time you inject non trustworthy values inside a dynamically generated HTML page, you have to invoke an entities encoder unless you authorize the injection to change the structure of the page itself (that is, the injected value itself contains real tags of the generated page).

Forgetting to encode data properly like in the example leads to the well known XSS attack (XSS stands for Cross-Side Scripting) which can have much more dramatic consequences than insulting you (or worse, your client). Using a templating engine is not enough: few of them perform automatic encoding and those who do often fall into the kind of trivial solutions which will be described later. Moreover, let me insist on one fact: it is not a problem specific to PHP and you can introduce security holes of this kind with almost all of the best web frameworks; also, AJAX-based technologies do not solve the problem!

As encoding is left to their responsibility, dramatic consequences of such attacks are due to developer's errors. At least, it seems to be the conclusion of the recent Experts Announce Agreement on the 25 Most Dangerous Programming Errors

When SQL's select performs a delete

First of all: if you want to learn more, read the Wikipedia entry on SQL injection attacks.

Assume this time that your application allows a user to display the list of its recent buying. For this, many developers build SQL queries by string concatenations. The JAVA code excerpt below provides a typical example:

String buyerName = ...                  // some buyer name received previously
Connection c = ...                      // get a JDBC connection
String sql = "SELECT * FROM buying " +
             "WHERE buyerName = " + 
             "'" + buyerName + "'"
Statement st = c.createStatement(sql)   // create a query statement
ResultSet rs = st.execute()             // execute it and get result
...                                     // display results to the user

As previously, this program must be considered incorrect. If my name is O'Neil for example, it fails with a SQLException. Indeed, the created query will be the one below, which is not syntactically valid: the quote between O and Neil disturbs the query parser of the database engine, which raises an error.

SELECT * FROM buying WHERE buyerName = 'O'Neil'

Assume now that I've got write access to the buyerName variable, that is, that I can choose its value (because it comes from a GUI or an HTML form, it is shown as a parameter in the query string of a web application, ...). More dangerous than before, nothing prevents me to choose a value like ... this:

'; DELETE FROM buying WHERE buyerName='concurrent

and the result will be to execute the following query, with the obvious dramatic consequences.

SELECT * FROM buying WHERE buyerName = ''; 
DELETE FROM buying WHERE buyerName='concurrent'

What is this to say? That building a query be concatenation of strings is generating code (here, it will be executed by a database engine). And, once again, that writing a program that generates correct and secure code is not an easy task: a program that looks correct sometimes is not!

You probably know the solution to this problem as well. There's two solutions in fact. The first way consists in invoking some utility functions to encode values coming from the user before injecting them between quotes in a SQL query. The kind of encoding may depend on the database engine, and for example takes care of backslashing quotes (what is called backquoting). O'Neil, will become O\'Neil and the backslashed quote will not disturb the query parser at all! The same is true for the second example: the first quote of my dangerous value will be backslashed, not allowing me to inject real SQL code as previously.

The second way is probably considered as the best practice, and consists in creating what is called prepared statements, as in the following example:

String buyerName = ...         
Connection c = ...             
String sql = "SELECT * FROM buying WHERE buyerName = ?"
PreparedStatement st = c.prepareStatement(sql)
st.setString(1, buyerName);
ResultSet rs = st.execute()

Creating queries this way also solves the problem: question mark are replaced by values at query execution time; this replacement may be seen as taking care of encoding values properly, based on their type (so that Strings will be first backquoted and then enclosed inside single quotes, for example).

Forgetting to encode data properly when creating SQL queries by string concatenation leads to the well known security attacks called "SQL injection attacks" Moreover, let me insist on one fact: it is not a problem specific to Java and you can introduce security holes of this kind with almost all of the best languages. For example, in Ruby, the (much more developer-friendly) string below suffers the same kind of problem (Ruby automatically replaces the #{buyerName} part of this string by the invocation of buyerName.to_s):

buyerName = ...
sql = "SELECT * FROM buying WHERE buyerName='#{buyerName}'"

As backquoting is left to their responsibility, dramatic consequences of such attacks are due to developer's errors. At least, it seems to be the conclusion of the recent Experts Announce Agreement on the 25 Most Dangerous Programming Errors|

Where are we moving now?

This first article about wlang were an introduction to the problem it solves. Even if we have mainly focussed on security problems, the problem is more general: building programs that generates code is an hard task. Indeed, each time you inject a value as part of the generated source you have to respect the syntactical and semantical rules that hold where the value is injected. Those rules are numerous and complex:

  1. they depend on the target language (i.e. HTML, SQL, Java, Ruby, ...)
  2. they may depend on a dialect: backquoting in MySQL vs. doublequoting in Sybase for example.
  3. they depend on the place where you inject the value: single-quoted strings vs. double-quoted strings for example.
  4. they depend on the injection semantics you want: does the injection participates to the HTML tree structure or not?
  5. they can change dynamically during generation: what if you generate an HTML page that also contains generated javascript or CSS? Or you could generate Ruby code that embed generated SQL queries, ...

I agree with the experts: developers make mistakes. But I also claim: there is a lack tool support for such tasks, even the simpler ones like generating an HTML page ... WLang is an attempt to provide such a tool support: it is sufficiently abstract to have implementations in different languages, and sufficiently powerful to provide a robust and elegant solution (at least, I hope so!). The second part of this paper introduces its foundations, the third one builds on these foundations and shows how a powerful HTML templating engine can be created using wlang.