June 9th, 2009

Wake up and smell the coffee (through an HMAC filter)

Hey, stay out of my index!

So when I first joined Acquia, my fledgling Solr hosting service had IP based security. You, the customer could tell me what IPs you were going to connect with, and I would allow access to your search index from those IPs.

One of the first major tasks was to implement HMAC based authentication to the service to ensure against man-in-the-middle attacks and provide a way to use from any IP. Also, it is standard operating procedure for other Acquia services.

Fail first!

In the first iteration, we built something on the load balancers (which run nginx) because it provided a central point of access control, the balancers were under-utilized and we didn’t have to mess with the Solr code.

This worked okay for awhile, and was decently fast but was quite flaky as some stupid developer had the brilliant idea to implement it as python middleware with fcgi (flup). That developer was me.

Don’t fail second!

So to combat the unstable nature of the fcgi protocol, and to make things a little more efficient, I (along with help from Peter Wolanin and Douglas Hubler) rebuilt it in Java using a Servlet Filter. This was a royal pain the butt, as Java is pretty tricky when it comes to input streams and buffers.

Thankfully the results are worth it:

It’s hard to tell from this graph because of the peak, but the median stayed almost the same (blue line), and the average decreases pretty significantly (purple) as does the 90% line (yellow). Click the image to see it larger.

source=solr_nginx_access (eventtype=solr_search_request)| timechart span=2h median(request_time), perc90(request_time), avg(request_time) as avg_request_time - in the past 3 days - ip-10-251-75-227 - Splunk 3.4.8

This graph shows the standard deviation (blue) in addition to the previous numbers and describes more acutely what the previous graph suggests, that is, the previous implementation was not any slower really, but less consistent, causing some of the requests to take much longer than others.

source=solr_nginx_access (eventtype=solr_search_request)| timechart span=2h stdev(request_time), median(request_time), perc90(request_time), avg(request_time) as avg_request_time - in the past 3 days - ip-10-251-75-227 - Splunk 3.4.8

So there you have, Acquia Search is both secure and fast and now 200% more reliably fast :)

March 17th, 2009

Acquia Search is rocking!

Just want to make a quick update and say that at long last my search project is on the internets and getting decent uptake.

Peter Wolanin and I presented at DrupalCon (don’t laugh too hard at me).

I think the reception so far has been great, and the servers have been champs :) We’re getting more and more signups every day.

One really cool one is Bryan from the CMS Report:
http://cmsreport.com/search/apachesolr_search/Drupal%20Search

If any of you out there want to find out how search can change how you build your sites and bring more people together with pages they need. Just click this link now:

http://acquia.com/products-services/acquia-search

A lot of people are excited but worried about trying out the free beta because we haven’t released any pricing, and something this cool will certainly cost more than a few pesos.

Well, fear not Drupalers, we’ve heard your call and are working on releasing some preliminary pricing soon. We wanted to wait until the beta got rolling a bit before we did this, but we want people to know we have a commitment to making this technology available to as wide an audience as possible. Stay tuned to my blog and/or the planet/acquia.com for more updates on this.

In the meantime, signup for the beta, it really only takes 15 minutes to setup and won’t break your site or lock you in at all. Please give us feedback! We know there are a few usability hiccups still in the signup process, and we’d love your input so we can fix ‘em. Thanks!

February 10th, 2009

The private lives of public IPs and EC2 security groups

As many of you know, I’m working on a Hosted Search product at Acquia. We’re building a pretty cool page where you can get some analytics on your search index and what people are searching for. Here is the deets on Dries’s site (hope he doesn’t mind ;0 )

Path Finder
Uploaded with plasq’s Skitch!

For this, we’re using Splunk which is a tad more pricey than I’d like, but a really amazing tool. Basically, it is grep + awk + a kilo of coke + a dozen redbulls + a Ferrari Testerosa + the same HGH A-Rod has been chewing. I’ll write more about it at some point, but this screen shot should give you an idea:

Path Finder
Uploaded with plasq’s Skitch!

Anyway, we use Splunk’s API to grab data into acquia.com and show the page above. The page was taking 10 seconds to load… I was stumped. Splunk seemed so fast, a couple seconds is reasonable for loading a report from millions of records, but 10 seconds was pretty extreme.

Eventually we discovered it was not Splunk at all, but a separate call in our code to a webservice (Call it Info Server) in EC2 which was being firewalled by Amazon. This caused the request to sit there for 10 seconds, and then timeout.

Here’s how security groups work:

I’ve got 2 servers:
Web Server – Serves static files (:80 and :443) and passes tough stuff to app server
App Server – serves requests back up to web server on :8080

Web Server needs to be able to access App server to push proxied requests through.

In EC2, each server has 1(or more) security groups. A security group is a list of access rights. These can be by Port & IP Range or they can be references to other groups. (wtf?)

Yeah, so the rule for the web server would probably be something like:
IP:any Port:80
IP:any Port:443
IP:111.111.111.111/24 Port:10000 (maybe some admin port for a certain location to access)

For the App Server, we don’t want 8080 world readable. We also don’t know the IP of the web server because this is elastic baby, servers can’t stand still. That’s why we give group permissions. So it looks like:

Group: Web Server

Which means any server launched on your account with the security group “Web Server” will have total access to any server launched with the security group “App Server”. Got it?

If not, here is an FBI style blackout picture which might make it more clear:

Path Finder
Uploaded with plasq’s Skitch!

In our case, we had a problem because we were referencing the external IP of our server(Info Server). See in the depths of the Amazon, each machine has a public IP and a private IP. So when you look for infoserver.acquia.com (made up, btw) it will resolve to 74.x.x.x When you try to look for ec2-10-45-123-41.compute.aws…. it will resolve to 10.24.134.41 and both point to the same place. The difference of course is that the security group settings only apply when you are using the internal IP even if both servers are inside the cloud

Hope you’ve been saved some pain.

Please come checkout Peter Wolanin and I as we present the future of Drupal search (we hope) at DrupalCon!

January 14th, 2009

Making Module Installation Easy for Acquia Search

Jeff Noyes (our Simplicity Guru), Linea Rowe, Peter Wolanin and myself sat down to discuss how the install process for our Hosted Search Service would look (yes, we’re getting close – Private Beta is out in two weeks)! Typically, when you have a faceted search engine, there is a set of filters on the left and search results on the right, with the sorting links generally horizontally aligned somewhere near the search box.

Here are a few examples from around the web:

Newegg.com - 15
Uploaded with plasq’s Skitch!
stuff : Clearance : Target Search Results
Uploaded with plasq’s Skitch!
pancakes, Books, DVDs Movies items on eBay.com
Uploaded with plasq’s Skitch!

And here is our current implementation:

Search | Dries Buytaert
Uploaded with plasq’s Skitch!

Here is the same shot, but broken down in “drupalish”

Search | Dries Buytaert
Uploaded with plasq’s Skitch!

I think it works okay, but we’re concerned that when people enable the module, they will have a hard time getting this together. Here is a series of screen shots of a user, enabling and setting up the module:

Modules | ad
Uploaded with plasq’s Skitch!

This part is simple (if you use Acquia’s hosted search), you just enable one module and you are done configuring the connection to Solr.

However, you standard search ends up looking like this:

Search | ad
Uploaded with plasq’s Skitch!

To get all the nice sorting and facet filters, you need to know (somehow) to go to admin/build/blocks and drag the ApacheSolr: blocks into regions like this:

Blocks | ad
Uploaded with plasq’s Skitch!

So what do people think? Should we just enable a few blocks “out of the box” and hope you are using garland and have a region named “left” or “left-sidebar”? If so, which blocks? Alternately, how can we provide a good workflow for people to know they need to do that extra step to setup their search. The other option Jeff suggested (which is most usable) is to have one block, where you can select what filters you want in it. The downside is that the user loses flexibility about where to but filters (maybe they want sorting on the right, etc).

I’d like to get some feedback on:

A). How to make this process so simple that it really is just checking that one box on the modules page and letting cron run and it looks great for 90% of users.

B). What the default blocks to enable are, and where they should be on the screen

C). How do we address this problem of multi-step installs which want to setup blocks in a more usable way for newbies?

See ya!
jacob

November 28th, 2008

What could search look like on d.o. and g.d.o

Robert Douglas, Peter Wolanin and I are scheming up what we hope to be a jaw dropping presentation of ApacheSolr + Drupal integration at DrupalCon DC. We’re going to show a prototype of d.o. and g.d.o hooked up the Apache Solr search server. We all know that d.o. and g.d.o. are notoriously hard to search through.

For instance, take this query:

http://drupal.org/search/node/views (searching for views).

Search | drupal.org
Uploaded with plasq’s Skitch!

Read the rest of this entry »

August 12th, 2008

250k nodes working to save our habitat

I had the privilege of working with Srijan Technologies this spring on Drupal and Agile Development trainings for their team and helping them get Apache Solr kicking for the India Environmental Portal which just launched last week to much fanfare.

The site is based on Drupal 5 and features:

What is an environmental portal?

India Environmental Portal is an initiative of the Center for Society and the Environment, one of India’s oldest and most revered environmental NGOs. Here is an excerpt from their about page:

This is the age of environment. And to make a difference, in our lifestyle, in policy and in practice we need information, which is accessible, well categorized and easy to use. The India Environment Portal is our effort to put together a one-stop shop of all that you want to know about environment and development issues. Its politics is overt: to build open, networked and informed societies, who can use knowledge to make change…..

This is a people’s portal. It will actively collate and exchange data, research and information from people working in the field, in campaigns, in scientific institutions, in research and in industry.

I recommend checking out the about page to find out more about this exciting resource:
http://www.indiaenvironmentportal.org.in/content/about-us

Congratulations to Ipsita, Rahul, Syed, Shashank, and all the rest of the excellent team at Srijan!

And a special thanks to drunken monkey and Robert Douglass for their work to integrate Drupal and Apache Solr.

Some press about the launch

http://www.financialexpress.com/news/National-portal-on-environment/347761/

http://in.news.yahoo.com/43/20080811/812/tnl-national-online-portal-on-environmen.html

http://www.thehindu.com/holnus/002200808112067.htm

http://www.ecoearth.info/shared/reader/welcome.aspx?linkid=104697&keybold=climate%20forest%20environment%20warming

http://www.indiaprwire.com/businessnews/20080811/32685.htm

http://alootechie.net/content/indiaenvironmentportalorgin-launched-provide-environmental-information

July 1st, 2008

Solving bad IA using enterprise search (Reverse Advanced Search)

Since I started working with Apache Solr in Drupal, I’ve realized how much client money has been wasted making ill advised advanced searches. We’ve all gotten the requests for “advanced” searches and it makes any IA-god fearing developer cringe. For the 1% of users who use them, you blow tons of budget, and the result is often quite poor because the client doesn’t really know their data or their users that well.

For those of you who are unfamiliar with faceted search compare the following:

I did a search for WSXGA because I’m looking for a laptop with decent resolution on two sites.

Laptops Direct

vs.

New Egg

(click to enlarge image in new window)

The New Egg search lets me filter, so I know that I’m looking for a laptop between $750 -> $1000, I’ll get 5 results. After that filter, I’ll know what’s available, and the # per manufacturer etc.

Contrast that with an advanced search form where I have to put in all my criteria, and hope I get a result. I might also miss certain results if my vocabulary is bad, or I don’t understand that the website says “high resolution” instead of WSXGA, so I don’t select it.

I think it’s obvious to anyone why faceted search is a good thing. In my next post, I’ll be exploring why is hasn’t gotten widespread adoption, particularly in the small business / NGO sector, and how I plan to help change that.

June 30th, 2008

Solving bad IA using enterprise search (Vocabulary)

Long time no blog…

I had a bit of a realization (or rather a resurgence of a recurring realization) that I enjoy writing. It happened this weekend as I was “getting away from it all”.

I’ve been interested in enterprise search for small and medium enterprises for a while now. Having implemented the Google Mini and the GSA, I’ve seen how a good search can really turn your information architecture on its head in a good way. Like any conversation between two entities be they two people or a person and a website, communication is difficult, and many of the same rules apply:

Vocabulary

You have to speak the same language. This doesn’t mean an Thai can’t talk to a Nigerian, but it does mean that when you are communicating, if the same word doesn’t mean the same thing (which it never does), your intentions, delivery and content is worthless. That is why non-verbal communication and communication over phones is so ineffective compared to face to face meetings. The words may be the same, but the interpretation never is.

So what does this have to do with enterprise search? When a user wants something from your website, they are looking for a keyword. Many computer scientists have tried to make linguistic aware search engines which correctly interpret sentences and question. Some of these results are useful, but generally, I believe people don’t come to a site thinking “Do you have any

    red Toyota Corolla

.” Internally, they are simply thinking “Corolla” and “Red”. For instance, I could speak only two words of English, Red and Corolla, and chances are, I could walk into any American City and rent a red Toyota Corolla.

When one plans information architecture for a site, they usually start with Persona or stereotypes of users, which have goals. And then you define actions they take to meet those goals, and try to use the same vocabulary and thought process of these users to make an interface which is organized like their brain. But when you have 10 different Persona, how is this possible? And within your 10 stereotypes, there can be huge variation and outside of your assumptions, there may be other users you never thought about. By having an good search engine, even if you have one page about red Mustangs which is buried in your site, people might find it. By having an excellent search engine which has synonyms, facets, spell checking, related results, etc you may be able to help the user not only find what they thought they were looking for, but contextual information about it. What if there is a mechanic looking for parts who types in

    1988 Corolla Fuel Pump

into a search, shouldn’t the search engine know what years the fuel pumps for Corollas available are the same and available, and allow him to filter? Shouldn’t it know that in the late eighties the Chevy Nova was a clone of the Corolla, and had the same parts cheaper?

This is the type of high value information which comes from dealing with a real human, and no amount of brilliant forethought in information architecture can pre-assume what the person is actually looking for. If I were doing IA for a parts website sans search, I’d have to have categories by model, by year, etc Even in a straightforward example like that, value is lost. That’s why search engines need to ask the extra question, and today’s search engines that most sites use are not.

Faceted Search

Next whenever, I’ll write something about faceted search (fancy name for search with fields and filters) and how I think the combination of Apache Solr and open source CMSs like Drupal, Typo3 and Joomla, are going to pave the way to an entirely new concept of information architecture and where we spend out usability testing money.

How To find me

Telephone: +1 510.277.0891 | Email: jacobsingh at gmail daht calm

Solution Graphics