June 30th, 2008

Similar Nodes Module Released!

I’ve just released a new module which I hope will fill a general need in the Drupal community.

This module allows you to show nodes in a view which have the same taxonomy as a node which you pass in as an argument. BUT it goes a step beyond this. By using a weighting table, one can specify the weight of each vocabulary in making the comparison.

Say you have two vocabularies on a movie site “Genre” and “Community Tags”, the latter being a free tagging taxonomy. With this module, you can give Genre a weight of 10 and Community tags a weight of 1. Which means that if I’m looking at “Meet the Parents”, I’m most likely to see a list of similar Romantic Comedies like “Sleepless in Seattle” and “French Kiss” with something like “Taxi” or “The Godfather II” coming lower in the list, because someone tagged both “Robert De niro”

See the Module Page for more information and how to use.

I’m going to be cleaning up the code a bit and providing a sample view soon, but any brave souls out there who would like to test / suggest, please go ahead!

June 30th, 2008

Solving bad IA using enterprise search (Vocabulary)

Long time no blog…

I had a bit of a realization (or rather a resurgence of a recurring realization) that I enjoy writing. It happened this weekend as I was “getting away from it all”.

I’ve been interested in enterprise search for small and medium enterprises for a while now. Having implemented the Google Mini and the GSA, I’ve seen how a good search can really turn your information architecture on its head in a good way. Like any conversation between two entities be they two people or a person and a website, communication is difficult, and many of the same rules apply:

Vocabulary

You have to speak the same language. This doesn’t mean an Thai can’t talk to a Nigerian, but it does mean that when you are communicating, if the same word doesn’t mean the same thing (which it never does), your intentions, delivery and content is worthless. That is why non-verbal communication and communication over phones is so ineffective compared to face to face meetings. The words may be the same, but the interpretation never is.

So what does this have to do with enterprise search? When a user wants something from your website, they are looking for a keyword. Many computer scientists have tried to make linguistic aware search engines which correctly interpret sentences and question. Some of these results are useful, but generally, I believe people don’t come to a site thinking “Do you have any

    red Toyota Corolla

.” Internally, they are simply thinking “Corolla” and “Red”. For instance, I could speak only two words of English, Red and Corolla, and chances are, I could walk into any American City and rent a red Toyota Corolla.

When one plans information architecture for a site, they usually start with Persona or stereotypes of users, which have goals. And then you define actions they take to meet those goals, and try to use the same vocabulary and thought process of these users to make an interface which is organized like their brain. But when you have 10 different Persona, how is this possible? And within your 10 stereotypes, there can be huge variation and outside of your assumptions, there may be other users you never thought about. By having an good search engine, even if you have one page about red Mustangs which is buried in your site, people might find it. By having an excellent search engine which has synonyms, facets, spell checking, related results, etc you may be able to help the user not only find what they thought they were looking for, but contextual information about it. What if there is a mechanic looking for parts who types in

    1988 Corolla Fuel Pump

into a search, shouldn’t the search engine know what years the fuel pumps for Corollas available are the same and available, and allow him to filter? Shouldn’t it know that in the late eighties the Chevy Nova was a clone of the Corolla, and had the same parts cheaper?

This is the type of high value information which comes from dealing with a real human, and no amount of brilliant forethought in information architecture can pre-assume what the person is actually looking for. If I were doing IA for a parts website sans search, I’d have to have categories by model, by year, etc Even in a straightforward example like that, value is lost. That’s why search engines need to ask the extra question, and today’s search engines that most sites use are not.

Faceted Search

Next whenever, I’ll write something about faceted search (fancy name for search with fields and filters) and how I think the combination of Apache Solr and open source CMSs like Drupal, Typo3 and Joomla, are going to pave the way to an entirely new concept of information architecture and where we spend out usability testing money.

June 29th, 2008

the t() that took down a webserver

Hey folks,

I feel compelled to announce this. Please Please Please read this post if you have done any multilingual drupal development.

Do not use t() outside of a hook or before locale gets to build its cache

I was asked to do some profiling on a colleague’s site, and I found this little doozy in the location_views module:

http://drupal.org/node/253813

The offending line is:

 define('LOCATION_VIEWS_UNKNOWN', t('unknown')); 

Okay, so what is the problem? When drupal starts up, it runs through many “bootstrappping states”:

Booting Drupal’s locale module

The last one is:



 case DRUPAL_BOOTSTRAP_FULL:
      require_once './includes/common.inc';
      _drupal_bootstrap_full();
      break;
  }

Now let’s look at the end of _drupal_bootstap_full:


// Load all enabled modules
  module_load_all();
  // Initialize the localization system.  Depends on i18n.module being loaded already.
  $locale = locale_initialize();

So the modules are getting loaded - module_load_all (they are included) before the locale module is initialized. In location_views, the offending statement is just interpreted upon inclusion because it isn’t wrapped in any hook. So it gets called before the locale_initialize() function.

This last call $locale = locale_initialize() sets to the global $locale variable to the iso language code the user is looking at the site in. If it is not set, see what happens when someone calls t():


function t($string, $args = 0) {
  global $locale;
  if (function_exists('locale') && $locale != 'en') {
    $string = locale($string);
  }

.......

Okay, so look at the first conditional. This will obviously return true every time, because locale is == to null when a module uses t() before locale is initialized. So if we dig into locale, what do we see:



function locale($string) {
  global $locale;
  static $locale_t;

  // Store database cached translations in a static var.
  if (!isset($locale_t)) {
    $cache = cache_get("locale:$locale", 'cache');

    if (!$cache) {
      locale_refresh_cache();
      $cache = cache_get("locale:$locale", 'cache');
    }
    $locale_t = unserialize($cache->data);
  }


So what happens?

We get here, because locale_t doesn’t exist yet:

$cache = cache_get("locale:$locale", 'cache');

We are trying to get a cache for “locale:”. Obviously, this does not exist. Because of this, locale says, okay, let’s refresh the cache.

The locale cache

The locale cache is a massive serialized array of strings and their matches from locales_source and locales_target. Let’s see how it is formed:


function locale_refresh_cache() {
  $languages = locale_supported_languages();

  foreach (array_keys($languages['name']) as $locale) {
    $result = db_query("SELECT s.source, t.translation, t.locale FROM {locales_source} s INNER JOIN {locales_target} t ON s.lid = t.lid WHERE t.locale = '%s' AND LENGTH(s.source) < 75", $locale);
    $t = array();
    while ($data = db_fetch_object($result)) {
      $t[$data->source] = (empty($data->translation) ? TRUE : $data->translation);
    }
    cache_set(”locale:$locale”, ‘cache’, serialize($t));
  }
}

OUCH!

This loops through each available language, and does a join selecting every string from locales source and locales target. It then builds a massive array (the bigger the translated site, the bigger the array) which eats up a huge amount of RAM - For reference, on http://www.amnesty.org, we’ve got over 10k translated strings, so we’re talking a few MB.

Then it serializes the whole thing - which uses a good amount of CPU, and then cache_set writes it to the cache table.

That’s where it gets really bad. When cache_set makes a write to the cache table. It first runs
LOCK TABLES cache;
then INSERT INTO cache…

So the cache table is effectively frozen up for anyone who wants to get data or put data into it. So think about this. If you have a 3MB locale cache array you are inserting into cache, and it takes 200ms to make that insert, any other user trying to access the site who wants to make queries to cache for variables is in line.

This creates a cascading effect where other processes which could have finished quickly are now holding up RAM in apache, waiting for access to the DB. If your server isn’t fast enough, this basically runs mysql’s process limit up to it’s max, and people start being unable to connect, the DB server gets partial inserts, deadlocks, all kinds of ugly stuff.

What’s worse, if t() is used outside of a hook, it will fire on EVERY page load. So on the site I was doing the profiling for, it uses a lot of AJAX. So every AJAX request was actually running this massive insert as well!

In xdebug, running on my sandbox, I was able to bring the avg page load time down from about 3-4seconds to 300-500ms by the simple patch referenced above.

Conclusion

  • Don’t ever run t() outside of a hook
  • Don’t ever run t() on non-static strings (if you have enough of them, this same thing will happen every time a new one appears in the system
  • Watch out for cache_sets in your application. They can be a real silent killer. Everything will work fine, but you are killing yourself performance wise. I suggest using xdebug, and if nothing else, go into cache_set, and add debug_print_backtrace(); to just see everyone who is using it.

June 28th, 2008

Media Mover Workflow needs

I’ve been playing around with Media Mover this week for Arthur with the goal of improving the general use cases for the community and giving some code review.

For those who don’t know, here is an excerpt from the project page of media mover:

Media Mover is a set of modules which allows website administrators to easily create complex file conversion processes. The core of Media Mover is the media_mover_api module which creates a set of rules allowing multiple modules to interact with a file. Media Mover can take a file emailed to an email account, turn a file attachment into an FLV file, create a new node with the file data, and then save the file on an external file storage like Amazon’s S3 all at once. And that’s just the start.

Wow! So does mm live up the hype? Of course. Arthur wrote it. :)

What I’ve been looking into the integration of Media Mover with workflow-ng and asset. Media Mover’s strengths right now are harvesting from multiple sources like ftp servers, email accounts, local stores, etc… and processing the video / audio / image via ffmpeg.

I’d say the main weakness it has is a drupal waekness, and that is how do files exist in the drupal node paradigm? Where is their meta-data stored?

There are so many different ways people do this task, generally with modules like image or video (which is basically file as a node). Or with modules like videofield, imagefield, etc which is file in files table, and a reference from a CCK field.

This kinda works, but asset provides a much more robust media integration framework, wherein an asset has metadata, it has formatting options and it can be embeded in the text body, or attached as a cck value.

The first goal of the integration is use the store hook in media mover to store the incoming media as an asset. Secondly, we’re going to build a store function to create a new node, and place the asset in a CCK field for that node.

The ultimate scenario is to create multiple assets for multiple processing instructions, so we have a folder of low res videos and a folder of high res videos, and we are automagically somehow adding the created assets to CCK fields (think I have a store selling video, and I have a “preview” CCK asset field and a “full version” CCK asset field).

There is inherent data model problems here both with asset and MM, but let’s see what we can do.

More next week when I’ve written the module, but if any Media Mover users are reading this, how is your experience with MM? What do you perceive as its lackings (if any) and what can be done to make it simpler / more intuitive for you?

-J

June 27th, 2008

Good Bye Appendix, Hello Google App Engine

While an appendicitis is a painful, uncomfortable and ultimately debilitating condition - one which I have suffered with for over a year, its remedy (Laproscopic surgery, way too much money to an evil hospital, a shitton of painkillers and a week of bedrest) is fairly tame as far as surgical procedures go.

PHPitis is another condition I have been suffering from. It’s not early as painful and has provided me with a decent income and professional growth. It is also not a pussy inflamed organ with no apparent value pressing up against all the nicely formed and functioning organs around it. However, the feeling that there is something not right “down there” remains.

I’ve always known that python is the best language out there. I don’t mean to start yet another stupid language fight, because every language is fine and dandy. But honestly, every programmer I respect has told me the same thing, they wish they could just write python. The only knocks against python are:

  • Lack of hosting options
  • Lack of Jobs / Skilled personnel

The first one, is obviously stupid, because it is as easy to run as PHP, but is totally chicken and egg. The second is pretty much the same. You have to be a better programmer to write python, and so there are less people who can, so there are less CTOs wanting to risk a language with less people available.

Anyway, for what it’s worth, I love python, and I’m not that good at it yet, because I never get jobs in it, and it’s hard to get clients signed up for something that isn’t hosted anywhere…. Until now.

A day before I went to the hospital for an examination into yet another bout of my awful stomach pains which turned into the aforementioned surgery, I got my App Engine key in the mail.

I am affraid of the big G as much as anyone, and this offering is too new to judge. I won’t bore everyone with the questions everyone is asking about if your code can be moved or not, what is the guarantee they will keep it around, etc. I just want to say that the data api and the security of knowing that such terse syntax with so much power will scale till the end of the googleverse is exciting.

I highly recommend everyone give it a whirl, and discover / rediscover beautiful code. I was thinking of “killer apps” to build on it. My first inclination is that the “killer app” for the gapjinn (as I have now started calling it) is something with the following characteristics:

Uses Google Apps data apis

As much as possible, the application should store it’s data in google sites, google docs, google calendar, etc. This is because google will always have a commitment to making these two platforms work together AND you will not be made irrelevant because you can’t integrate.

Uses Google Apps provisioning APIs

Because SaaS is in, and you will want to make setup super easy.

Uses Google for authentication

Because you will not survive in the enterprise if you can’t take advantage of single sign-on services. As long as you are using all the google apps stuff, you should authenticate the same way.

Is niche enough that it does something better / easier / more focused than native google apps
OR

Provides an integration between different google apps and a 3rd party service

On the first count, it’s about creating a wrapper, basically a new interface layer to an existing google apps platform. Sometimes simple is too simple. For instance, using a series of tags and and a small amount of data store in the Google BigTable, I bet you create a really quick and dirty CRM using google contacts + gmail + google sites + google calendar.

On the second, I think the potential to provide bridge functionality between more mature best of breed apps like basecamp, unfuddled, salesforce, will make quick enterprise dashboards possible. Something like a dashboard / middleware layer between this stuff.

Let’s see…

June 2nd, 2008

Awesome SVN training materials under CC

I’m doing a training for Srijan this week with their developers on good version control and change management practices, and while googling around I ran into this:

http://www.polarion.org/index.php?page=overview&project=subtrain

It is a fantastic set of training materials generously provided under CC 2.5 by Polarion.org.

it features ppts for admins as well as users, evaluation and feedback materials, as well as sets of exercises to simulate normal working environments. I can’t say enough about these materials, they are really top notch.

Major cred goes to the good folks providing them. Please check them out!

How To find me

Telephone: +1 510.277.0891 | Email: jacobsingh at gmail daht calm

Solution Graphics