I have relocated the blog to my own custom server.
From now on, you will be able to follow this blog at:
http://www.gfilter.net
Tuesday, January 27, 2009
Sunday, January 11, 2009
Graphing Reddit Language Preferences
Programming Reddit recently had a survey of "favorite" languages with some surprising results.
I took the time to enter each "positive mention" of a language into Google Spreadsheet (I also wanted to play with that, I'm mildly impressed though its charting ability is weak at best.)
Results (Click for large image):

An interesting trend has developed here. I found that the the most maligned languages on Reddit had surprisingly high showings in this thread. C#, Java, C++, and Perl all ranked much higher than the more esoteric languages that are often advocated there.
Pythons current place at 2nd place is no surprise, and it is well deserved, but I was very surprised by the large showing of C# developers. They seemed to come out of the woodwork. The usual anti C# trolls actually made no showing at all, further convincing me that they aren't actually serious programmers regardless.
When you remove all of the languages that received less than 3 honorable mentions, It is easy to see that the massive majority of Reddit users prefer C#, Python, and C.

Not the often praised Haskell, Erlang, Scala or Lisp, but regular "get work done" languages.This is a great thing, and shows a resurgence of actual developers to Reddit, and not head-in-the-clouds academics.
Edit: The above sentence appeared to be misinterpreted by all, so I thought I'd clarify.
I'm happy to see other people with similar interests as me, people who actually write the software the world uses, as opposed to the people who invent the tools necessary to write the software. Huge respect to academics, but as a software developer, and not computer scientist, I am happy to find others of a like mind.
In my world, I don't have 15 years to invent a compiler, I have 6 months to get from business case to shipped. I sometimes consider academics as out of touch with the realities of commercial business, hence the "head-in-the-clouds" statement.
I took the time to enter each "positive mention" of a language into Google Spreadsheet (I also wanted to play with that, I'm mildly impressed though its charting ability is weak at best.)
Results (Click for large image):
An interesting trend has developed here. I found that the the most maligned languages on Reddit had surprisingly high showings in this thread. C#, Java, C++, and Perl all ranked much higher than the more esoteric languages that are often advocated there.
Pythons current place at 2nd place is no surprise, and it is well deserved, but I was very surprised by the large showing of C# developers. They seemed to come out of the woodwork. The usual anti C# trolls actually made no showing at all, further convincing me that they aren't actually serious programmers regardless.
When you remove all of the languages that received less than 3 honorable mentions, It is easy to see that the massive majority of Reddit users prefer C#, Python, and C.
Not the often praised Haskell, Erlang, Scala or Lisp, but regular "get work done" languages.
Edit: The above sentence appeared to be misinterpreted by all, so I thought I'd clarify.
I'm happy to see other people with similar interests as me, people who actually write the software the world uses, as opposed to the people who invent the tools necessary to write the software. Huge respect to academics, but as a software developer, and not computer scientist, I am happy to find others of a like mind.
In my world, I don't have 15 years to invent a compiler, I have 6 months to get from business case to shipped. I sometimes consider academics as out of touch with the realities of commercial business, hence the "head-in-the-clouds" statement.
Saturday, July 26, 2008
Benchmarking the CLR vrs the JVM
I've found that there are lots of flame wars regarding JVM/CLR performance, but very little solid data, so I thought I'd try running a benchmark of my own.
I decided the first benchmark would be sorting using a bubble sort. The bubble sort is highly inefficient, so it is a good demonstration of memory management and integer arithmetic.
I also wanted to see how well the JIT's would handle method by method optimization, so I wrote a version that used methods for each runtime, as well as a version that was entirely inline.
My theory was that the inline version would prove to be slower than the method invoking versions due to JIT optimizations when calling methods. In reality, it was even more apparent. However, when I did disable the JIT optimizations on the CLR by using a debug build, the inline CLR version was much faster than the method version.

This data was generated on my 2.4ghz laptop running Windows XP. I used the .NET 2.0 CLR, and JRE 1.6.0.5 (Which just happened to be the one that came with Eclipse). The Y axis is milliseconds.
The benchmark is performing a bubblesort on a 10,000 integer set, 100 times.
It shows that the CLR is a slight winner, at least in doing these operations. The big surprise was the speed difference the JVM JIT makes when optimizing methods.
Even the unoptimized CLR version was only slightly slower than the JVM.
The surprising lesson learned here is that premature optimization by inlining methods and unrolling loops may actually result in slower performance on the JVM.
If anyone wants to run these benchmarks on their machine, the full source is here.
I decided the first benchmark would be sorting using a bubble sort. The bubble sort is highly inefficient, so it is a good demonstration of memory management and integer arithmetic.
I also wanted to see how well the JIT's would handle method by method optimization, so I wrote a version that used methods for each runtime, as well as a version that was entirely inline.
My theory was that the inline version would prove to be slower than the method invoking versions due to JIT optimizations when calling methods. In reality, it was even more apparent. However, when I did disable the JIT optimizations on the CLR by using a debug build, the inline CLR version was much faster than the method version.

This data was generated on my 2.4ghz laptop running Windows XP. I used the .NET 2.0 CLR, and JRE 1.6.0.5 (Which just happened to be the one that came with Eclipse). The Y axis is milliseconds.
The benchmark is performing a bubblesort on a 10,000 integer set, 100 times.
It shows that the CLR is a slight winner, at least in doing these operations. The big surprise was the speed difference the JVM JIT makes when optimizing methods.
Even the unoptimized CLR version was only slightly slower than the JVM.
The surprising lesson learned here is that premature optimization by inlining methods and unrolling loops may actually result in slower performance on the JVM.
If anyone wants to run these benchmarks on their machine, the full source is here.
Tuesday, July 15, 2008
Normalize First
Sometimes Jeff Atwood does more harm than good. I wonder how many people will misunderstand his latest post.
I'm not positive that even he understands his latest post.
Take a look at his example query:
He argues that using 6 joins to pull together data on users is excessive, while missing the true point that 3NF (Then again, his "normalized schema" isn't even 2NF) enforces data integrity and ease of maintenance, not performance.
(Also, his join returns a boatload of duplicate data)
Normalize first. Denormalize only when its the last resort.
The scary thing is that people are going to read Atwood and then decide that their 200 column mega table is a good idea.
I'm not positive that even he understands his latest post.
Take a look at his example query:
select * from Users u
inner join UserPhoneNumbers upn
on u.user_id = upn.user_id
inner join UserScreenNames usn
on u.user_id = usn.user_id
inner join UserAffiliations ua
on u.user_id = ua.user_id
inner join Affiliations a
on a.affiliation_id = ua.affiliation_id
inner join UserWorkHistory uwh
on u.user_id = uwh.user_id
inner join Affiliations wa
on uwh.affiliation_id = wa.affiliation_id
He argues that using 6 joins to pull together data on users is excessive, while missing the true point that 3NF (Then again, his "normalized schema" isn't even 2NF) enforces data integrity and ease of maintenance, not performance.
(Also, his join returns a boatload of duplicate data)
Normalize first. Denormalize only when its the last resort.
The scary thing is that people are going to read Atwood and then decide that their 200 column mega table is a good idea.
Thursday, April 24, 2008
Its always Windows fault!
Jeff Atwood has been providing me with a lot of blogging material lately, I want to revisit his article that I talked about yesterday to study something that amused me greatly.
The common knee jerk reaction to Jeff's abysmal PHP/Wordpress statistics was that it was Windows fault, because he was running the site on a virtualized Server 2008 using IIS7.
As everyone knows, IIS is a horrible web server, the worst ever made, and Windows Server is really just a desktop OS with a fancy name and a higher price tag...Right?
This sort of knee jerk reaction is dangerous, because not only is it incorrect (How would the platform affect the number of queries being performed), but it also makes the commenter look, to put it nicely...stupid.
But, I was curious. I wanted to see how fast PHP was on IIS7, and I wanted to see some benchmarks of mySQL running on Windows versus mySQL running Linux.
However, I couldn't find any good benchmark results on mySQL. I found some interesting comparisons of mySQL versus SQLServer running on Windows (Guess who wins by a mile?), but it seems like nobody wants to run benchmarks between platforms. If you have or know of a benchmark, post it.
As far as PHP's performance on IIS7, it is actually very fast, when configured correctly.
I'm lifting some benchmarks from Bills IIS blog.
Here is PHP running through ISAPI, which means that IIS will multi thread it like it does ASP.NET. This can be somewhat unstable because a lot of PHP code is not written to be ran on multiple threads. This is pretty much how PHP runs on IIS6.

You can see that on Bills Vista laptop, he is getting about 60 requests per second. Not bad for just running on a dev box considering he is running the client software and the server on the same box. I'd say its safe to double every number he produces for a "in the wild" benchmark.
However, IIS7 now supports fastCGI, so lets rerun that benchmark using fastCGI. This will also be more stable, because it will be running on a single fastCGI thread.

With fastCGI setup, Bills little dev laptop is now pushing over 100 RPS. Not bad at all (again considering he is running the client software on the same CPU as the server), but we can do a LOT better.
IIS7 has kernel level caching abilities, so lets go ahead and enable that:

Notice that the scale of the graph has changed. This little laptop is now handling 6,000 RPS. I'm a little impressed with IIS7, you should be to :)
So we can rule out PHP/Windows as being the slow culprit, and we are on the fence if mySQL is slower on windows. I'm waiting for proof on that.
All this said, it will do nothing to prevent the knee jerk reaction of nonsense that is so prevalent on online communities. You can fight ignorance all you want, but I guess you can't fix stupid.
The common knee jerk reaction to Jeff's abysmal PHP/Wordpress statistics was that it was Windows fault, because he was running the site on a virtualized Server 2008 using IIS7.
As everyone knows, IIS is a horrible web server, the worst ever made, and Windows Server is really just a desktop OS with a fancy name and a higher price tag...Right?
This sort of knee jerk reaction is dangerous, because not only is it incorrect (How would the platform affect the number of queries being performed), but it also makes the commenter look, to put it nicely...stupid.
But, I was curious. I wanted to see how fast PHP was on IIS7, and I wanted to see some benchmarks of mySQL running on Windows versus mySQL running Linux.
However, I couldn't find any good benchmark results on mySQL. I found some interesting comparisons of mySQL versus SQLServer running on Windows (Guess who wins by a mile?), but it seems like nobody wants to run benchmarks between platforms. If you have or know of a benchmark, post it.
As far as PHP's performance on IIS7, it is actually very fast, when configured correctly.
I'm lifting some benchmarks from Bills IIS blog.
Here is PHP running through ISAPI, which means that IIS will multi thread it like it does ASP.NET. This can be somewhat unstable because a lot of PHP code is not written to be ran on multiple threads. This is pretty much how PHP runs on IIS6.

You can see that on Bills Vista laptop, he is getting about 60 requests per second. Not bad for just running on a dev box considering he is running the client software and the server on the same box. I'd say its safe to double every number he produces for a "in the wild" benchmark.
However, IIS7 now supports fastCGI, so lets rerun that benchmark using fastCGI. This will also be more stable, because it will be running on a single fastCGI thread.

With fastCGI setup, Bills little dev laptop is now pushing over 100 RPS. Not bad at all (again considering he is running the client software on the same CPU as the server), but we can do a LOT better.
IIS7 has kernel level caching abilities, so lets go ahead and enable that:

Notice that the scale of the graph has changed. This little laptop is now handling 6,000 RPS. I'm a little impressed with IIS7, you should be to :)
So we can rule out PHP/Windows as being the slow culprit, and we are on the fence if mySQL is slower on windows. I'm waiting for proof on that.
All this said, it will do nothing to prevent the knee jerk reaction of nonsense that is so prevalent on online communities. You can fight ignorance all you want, but I guess you can't fix stupid.
Wednesday, April 23, 2008
Why Relational Databases end up being the bottleneck
These days it is all the rage to diss on RDBMS. Just surf over to Digg or Reddit and you'll see lots of the following:
The list could go on and on. All of the cool kids are hyping new non relational databases such as SimpleDB and CouchDB, or even Googles BigTable.
I'm not in that camp. I do understand the shortcomings of a relational database, but I also know its strengths.
Personally, I don't use mySQL, and use SQL Server for my projects, and I have yet to have a scaling issue with it, so I didn't understand why people always claim that the database is the bottleneck in their application.
However, I finally understood because today Jeff Atwood blogged about Wordpress (One of the flagship LAMP products).
Wordpress in its unpatched and default state performs 20 queries to retrieve the posts on the front page.
In comparison, Codeblog performs 4, and I was unhappy with that, but that was due to trying SubSonic out instead of hand rolling my own database layer.
So, I gave Wordpress the benefit of the doubt, those 20 queries must be important right?
Wrong again, take this example from WordPress.
Can you count the WTF's there? Did the authors have no clue about normalization and writing efficient joins? There are no relationships between posts, comments and authors, instead they just keep mindlessly querying away.
Of course, the saddest part is that the solution is not to fix their database, its to instead install a caching plug in to hide the fact that their database and supporting code SUCKS.
The bottom line is don't tell me RDMBS can't scale if you can't write a decent query or design a normalized database schema.
Unless you are performing a complex 10 table join, its will still be cheaper than opening 10 database connections and then sorting it all out in your code.
Relational databases are not the bottleneck. Crappy programmers are, which is what I suspected all along.
- They don't scale well
- There is relational/object impedance problems
- They use massive amounts of CPU time
- Performing table joins is slow
The list could go on and on. All of the cool kids are hyping new non relational databases such as SimpleDB and CouchDB, or even Googles BigTable.
I'm not in that camp. I do understand the shortcomings of a relational database, but I also know its strengths.
Personally, I don't use mySQL, and use SQL Server for my projects, and I have yet to have a scaling issue with it, so I didn't understand why people always claim that the database is the bottleneck in their application.
However, I finally understood because today Jeff Atwood blogged about Wordpress (One of the flagship LAMP products).
Wordpress in its unpatched and default state performs 20 queries to retrieve the posts on the front page.
In comparison, Codeblog performs 4, and I was unhappy with that, but that was due to trying SubSonic out instead of hand rolling my own database layer.
So, I gave Wordpress the benefit of the doubt, those 20 queries must be important right?
Wrong again, take this example from WordPress.
SELECT SQL_CALC_FOUND_ROWS wp_posts.*
FROM wp_posts
WHERE 1=1
AND wp_posts.post_type = 'post'
AND (wp_posts.post_status = 'publish')
ORDER BY wp_posts.post_date DESC LIMIT 0, 10
And then:
SELECT FOUND_ROWS()
And then:
SELECT post_id, meta_key, meta_value FROM wp_postmeta WHERE
post_id IN (3,1) ORDER BY post_id, meta_key
Can you count the WTF's there? Did the authors have no clue about normalization and writing efficient joins? There are no relationships between posts, comments and authors, instead they just keep mindlessly querying away.
Of course, the saddest part is that the solution is not to fix their database, its to instead install a caching plug in to hide the fact that their database and supporting code SUCKS.
The bottom line is don't tell me RDMBS can't scale if you can't write a decent query or design a normalized database schema.
Unless you are performing a complex 10 table join, its will still be cheaper than opening 10 database connections and then sorting it all out in your code.
Relational databases are not the bottleneck. Crappy programmers are, which is what I suspected all along.
Thursday, April 10, 2008
GFilter open beta
GFilter has been a small project that I began this week, and I am now opening up for beta. Please drop any bug reports as a comment here.
The concept behind GFilter is simple. GFilter allows you to create a personal blacklist of sites that you do not want to show up when you search the internet. This means no more Experts Exchange when you are trying to find a handy regular expression etc...
Additionally, sites that are especially unpopular across users will be automatically removed from any search, so the collective force of GFilter users will sanitize the web.
Try it out, stick as your homepage for a few days and drop me a note with any defects/suggestions. This is an experimental project designed just for fun!
http://www.gfilter.net
As a side note, I desperately need a designer to come up with a good logo for this, the cliche web 2.0 logo works fine as a placeholder, but it needs to go now :)
The concept behind GFilter is simple. GFilter allows you to create a personal blacklist of sites that you do not want to show up when you search the internet. This means no more Experts Exchange when you are trying to find a handy regular expression etc...
Additionally, sites that are especially unpopular across users will be automatically removed from any search, so the collective force of GFilter users will sanitize the web.
Try it out, stick as your homepage for a few days and drop me a note with any defects/suggestions. This is an experimental project designed just for fun!
http://www.gfilter.net
As a side note, I desperately need a designer to come up with a good logo for this, the cliche web 2.0 logo works fine as a placeholder, but it needs to go now :)
Subscribe to:
Posts (Atom)