Debugging apache mod_proxy_balancer

14 10 2013

Below are some notes that I made while debugging mod_proxy_balancer. I had to set it up in a hurry when I realized that Amazon Elastic Load Balancer I was using is only capable of sticky session using Cookies. I needed a load balancer that can use a url parameter to maintain sticky session. Thankfully a friend suggest that we use mod_proxy_balancer.

There are lots of material about mod_proxy_balancer, but it is hardly simple to get it working. There are some less know details without which you cannot get it working. I would suggest you take a look at Nginx or other alternatives before choosing mod_proxy_balancer.

Tech stack summary

I had to serve a NodeJS based Javascript API using a Load Balancer capable of Sticky Sessions. The reason for sticky session is beyond scope of this blog :). The entire setup is on Amazon EC2 instances with CentOS.

The whole #!

Since I had only two servers to load balance, I assigned them ids s.1 and s.2. It is very important that the routes are named by an alphanumeric prefix, a dot and then a number. Eg: server.1, t.2 etc. The mod_proxy_balancer code splits this route name using the dot and uses the second value as the route number. So s.1 would point to “route=1”.

<Proxy balancer://mycluster/>
 BalancerMember http://<ip-address-1>:80 route=1
 BalancerMember http://<ip-address-2>:80 route=2
</Proxy>

The first request coming to the mod_proxy_balancer is randomly routed to any one of the load BalancerMember. Lets say this request is received by server with route id s.1. The server then serves the request along with its route id (routeId=s.1). All further requests from that browser should now contain the url parameter “routeId=s.1”. Below configuration in bold tells mod_proxy_balancer to read this url parameter and use it to route the request to server 1.

ProxyPass / balancer://mycluster/ lbmethod=byrequests stickysession=_nsrouteid
ProxyPassReverse / balancer://mycluster/

That should get things working. How do we know the above setup is working and is sending requests to appropriate servers.

LogLevel warn
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{BALANCER_SESSION_STICKY}e\" \"%{BALANCER_SESSION_ROUTE}e\" \"%{BALANCER_WORKER_ROUTE}e\"" combined
LogFormat "%h %l %u %t \"%r\" %>s %b" common
LogFormat "%{Referer}i -> %U" referer
LogFormat "%{User-agent}i" agent

BALANCER_SESSION_STICKY – This is assigned the stickysession value used for the current request. It is the name of the cookie or request parameter used for sticky sessions

BALANCER_SESSION_ROUTE – This is assigned the route parsed from the current request.

BALANCER_WORKER_ROUTE – This is assigned the route of the worker that will be used for the current request.

I have taken above information from mod_proxy_balancer documentation.

To begin with BALANCER_SESSION_STICKY should be the same as “stickysession” parameter in ProxyPass configuration. The BALANCER_SESSION_ROUTE will not be set for the first request from browser. BALANCER_WORKER_ROUTE will be chosen based on the load balance algorithm.

After first request is served by one of the servers, all further requests sent to the server should have routeId url parameter. The BALANCER_SESSION_ROUTE should show the value in url parameter. It should be “1” when the url parameter is “routeId=s.1”. BALANCER_WORKER_ROUTE will be the same as BALANCER_SESSION_ROUTE. This shows that the requests are sticky.





Subdomains, pretty urls and some config

15 01 2011

This post sort of collates information about using subdomains to make your urls look much nicer. Say suppose you are building a tumblr like service, then you would also think of providing subdomain based urls for each of your customers. Lets consider for the sake of explanation that you have a website called sconesandtea.com and you want to have several urls under this domain like cream.sconesandtea.com, jam.sconesandtea.com etc. Its not rocket science but it is rather painful to search for all the information yourself if you are new to this. This is more of a write up for myself, so excuse the free form writing style.

To start with you should configure subdomains with your domain registrar. Some basics are here. In short you have to make sure the intendend subdomain based url reaches your server in addition to your domain based urls.

Once you are done with that, take stock of the problem you have at hand.

  1. If you just want your url to redirect, that can be handled at nginx or apache level. For example if you just want cream.sconesandtea.com to redirect to sconesandtea.com/addons/cream, you can achieve this with url rewrite in nginx or apache. The point to bear in mind is that this setup results in http redirect and the url in your browser will not be cream.sconesandtea.com, but it will be sconesandtea.com/addons/cream after the redirect. We will go into this in detail in a bit.
  2. But if you do not want http://sconesandtea.com/addons/cream to be exposed to the outside world and the public url should be http://cream.sconesandtea.com, then it needs some logic to be built into the app.

Simple Redirection

Below is a snippet of nginx url rewrite module.

set $subdomain "";
set $subdomain_root "";
if ($host ~* "^(.+)\.sconesandtea\.com$") {
set $subdomain $1;
rewrite ^(.*)$ http://sconesandtea.in/addons/$subdomain;
break;
}

This will return a http 302 redirect. If you want the status code to be 301 append permanent key word to the rewrite url line.

rewrite ^(.*)$ http://sconesandtea.com/addons/$subdomain;

More on this here.

Handling subdomains at application level

The first thing to get past for this is to simulate the production scenario on a dev box. As most of you would know add the below entry in you /etc/hosts to simulate domain based url on local.

127.0.1.1    sconesandtea.com

But /etc/hosts does not support wildcard based subdomains. So for testing purposes add the subdomain specifically.

127.0.1.1    cream.sconesandtea.com

Based on the framework you are using there may be several ways of achieving the logic to use subdomains to render specific pages. For django you could use the middleware available here. This is quite a useful snippet. It makes subdomain available through request, which you can use elsewhere in your code. This snippet does not support subdomain based urls starting with www. So you may have to tweak it as per your application’s needs.

Please feel free to add or correct any information here.





Sqlserver Non-clustered indexes and deadlocks

5 07 2010

ORM tools and other abstraction on RDBMS have become ubiquitous. But there is no substitute for understanding the basics of a database. This opinion of mine was only reinforced by a recent issue which I was fixing with a colleague.

Bug: Error log for showed the below exception
System.Data.SqlClient.SqlException: Transaction (Process ID 53) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

Tech stack: .Net 3.5, sqlserver 2005, nhibernate

The exception stack trace pointed to the table that was being deadlocked.

Could not execute command: UPDATE Email SET PersonId = @p0 WHERE Id = @p1

Recreating deadlock issues is not a trivial thing. But thankfully in our case the deadlock was so severe that when I ran my tests in parallel, almost 50% of the transactions failed at a concurrent load of just 2. That was a decent first step since we were consistently able to reproduce the issue.

Sqlserver Management studio comes with some tools which are quite useful in this situation. To see what was causing the deadlock all I had to do was to run profiler on the database. To launch a profile follow the below steps.

tools > profiler > file > new trace > mention database details

The trace properties window should open up. Open the event selection tab and select show all events. This should show more events. Under the locks section select all the events that may be useful to you.

Start the trace, run your tests in parallel sit back with some popcorn and enjoy the action packed adventure. Run a find for Deadlocks and you should be presented with a nice picture of what is happening.

Lets zoom in on the action.

Inferences:

  • The deadlock is not on the object, because the object ids are the same. This is something which we also guessed from the query in the exception log UPDATE Email SET PersonId = @p0 WHERE Id = @p1
  • But the page ids are different.

Quite puzzled we looked at the table design to see if something was wrong there. And yes we saw what the problem was. The table did not have a primary key column.

Even though that may look like harmless issue, there are consequences of creating a table without a primary key in sqlserver. When you define a primary key a unique clustered index is created. But this table had a unique constraint on the id column, which would create a unique non-clustered index. Non-clustered secondary indexes may introduce deadlocks. More details in this link (See Non-Clustered indexes). You will also find it very useful to know how clustered and non-clustered indexes work.

In this case, the primary key and there by the clustered index was missing. We introduced a primary key constraint on the id column and  ran tests again. Even at a much higher concurrent user count the deadlocks did not happen again.





Customize gradle directory structure

4 06 2010

I started using gradle very recently. It is so much more easy to understand than maven. I guess I am not intelligent enough for maven. Gradle also follows a very similar directory structure to maven. But I wanted to change it as per my projects directory structure.

project
project/src
project/test

All I had to do was customize the sourceSets.

sourceSets {
main {
java {
srcDir 'src'
}
}
test {
java {
srcDir 'test'
}
}

I could not find this information straight away (especially the test sources location). So I am posting it here for future reference.





CI – Have we forgotten the integration bit?

22 05 2010

Almost all software that has every been built had to go through an integration stage in one way or another. It is quite odd that many software projects even to this day look at integration as a separate phase in the project. But things are changing quite fast and many projects adopting Continuous Integration. This is good news because we are saving ourselves a lot of time wasted in the so called integration bugs.

So what is the issue?

Lets cut to the chase. Mostly organizations tend to try out CI on a not so important team before it is adopted across all teams. But in such a scenario CI is only building a smaller system.  Even organizations that have been using CI for quite some time seem to have CI builds per team. But usually the software that the smaller teams are building are just small pieces in a bigger system. Integration bugs still have a longer feedback cycle.

It is not enough to have a CI infrastructure that only verifies the subsystem. CI must deploy the smaller module on to an environment with the rest of the system and run tests as a whole. This gives feedback on the integration.

So we get the point. Whats the big deal?

Actually setting up a good CI is not so simple. Unless the build system is not well thought after CI does not come for free. While writing the build scripts/code one must keep in mind the bigger picture. The effort involved is very similar to a traditional production release. It is also important to revise the build from time to time. While a build that fails for several unknown reasons is a pain, a build that is not proving anything is even worse.

CI should also not be an activity solely done by configuration management team. The people who are responsible for the application design also have a big role to play. One size does not fit all and concepts that proved successful in one project may not work so well for another. Also CI has to evolve with the system that it is integrating. Unless it is up to date with the latest design changes its not doing much.

If your CI is testing a subsystem it can only be called an automated build system for that module. CI needs to verify key integration issues and to some extent even performance. Writing build systems that truly integrate is an interesting activity. It gets you thinking on how two systems integrate. Questions that you would have not thought about previously suddenly become more obvious.

A Continuous Integration build must be run on a production like environment. If not, the application must be deployed to production as often as possible.  A well written CI also inspires confidence to move to a continuous deployment mode.

I do not claim to be an expert on CI, but I think it is important not to forget the principles behind practices. I would associate more importance on the integration bit in CI than smaller checks that verify code style etc.

Also I would recommend reading Sai’s blog on CI.

Let me know your thoughts on this.





Agile Bengaluru 2010

31 01 2010

Its been a week since agile bengaluru 2010. It was a really nice experience to meet J. B. Rainsberger, Jeff Patton, David Hussman, Naresh Jain and many more people all in one place. I liked the venue even though there was no wifi :P. More than anything else I really appreciate the “go green” theme. This is not a complete experience report, apologies for my laziness when comes to writing long posts.

Being a “Post Agile Conference”,  most topics were aimed at reflecting on how agile has helped us in the past and where are we going. Below are some of the sessions that I enjoyed.

Discovery and Delivery – Redesigning agility – Keynote by David Hussman. A great start to the conference.

Monkey See Monkey Do – by Naresh Jain and Sandeep Shetty – A very interesting talk that looked at some of the agile practices that have become dogma. It would have been great if we had some more time.

Outside the code – Using agile practices to drive product success – by Jeff Patton – It was nice to hear about the “Discovery” part of the agile software development. Go check out the slides.

Using Theory of Constraint and Just in time approach to coach agile teams – a workshop by J. B. Rainsberger and Naresh Jain.

Stop it or I will Bury you alive in a box – by J. B. Rainsberger – J.B. spoke about the 10 things we should stop doing in 2010.

Captain planet (Saurabh Arora) talks – Very inspiring talk on global warming. Keep up the good work dude.

Apart from this I got an opportunity to talk to Jeff Patton some time between the sessions. Spoke about how words like requirements, customers etc do not go well with software development.

I had to rush before all the lightning talks got over. Managed to listen to a some talks like “Agile Deployment”. Just before I left I spoke about “Developer + Tester + Operations = DevTestOps”.

Programming with the stars – A very entertaining session. The participants had to impress the audience to get selected to the next round. The winners would pair four accomplished developers (stars) and come up with a five minute coding exercise that they have to present to a panel of judges (J. B. Rainsberger, Jeff Patton). Very entertaining. Really appreciate the participants and the stars for live coding in front of a sizable audience. The winners got a life time e-learning license from Industrial Logic.

Last but not the least enjoyed speaking about Breaking the monotony with Sai Venkat. Liked the way the talk went. Agility we seek is from the code we write and systems we build and not just from processes and practices we follow. This was the theme of our talk. Got great constructive feedback from the audience.

The conference ended with an open Q and A session.

Kudos to the organizers. The slides are available and the videos should be available soon.

Looking forward to see if we can get the Dogma out of agile and build great software. Feel free to add any comments about topics that I have miss out.





Jumping through hoops to represent trees in Database

29 12 2009

Recently I have been working on a project where we have to represent hierarchical data in Database. Unfortunately we do not have much choice with the database. We are using a relational database.

If you have done this, you will agree with me that it is not a very enjoyable experience.

Firstly we need to choose between several models to represent trees in database

a. Adjacency (self referential tables)

b. Materialized path (lineage)

Shortcomings of adjacency model

Tree traversal is costly in adjacency model. Finding out children and grandchildren of a parent may be quite complex

Shortcomings of materialized path

Materialized path requires you to build this information at some point in time. If you have a million records for which you need to build materialized path, then I suggest you start now, because no knows when it will end. If some one knows of an efficient way of doing this please let me know. If you get past this stage, then there is the issue of updating the data to handle moves and deletions.

Static and Dynamic Data

The choice we make is mostly driven by how many changes can we expect. If we are never going to modify the data, probably materialized path any other approach which stores the lineage information alongside each row is useful. But this is rarely the case.

Some vendor specific help

The guys at micrsoft and oracle seem to have seen this issue and suggest the use of below techniques for this issue.

Sql Server

1. Common table expression: Popularly known as CTE, this is a way to run recursive queries on a self-referential table.

2. HierarchyID: This is a datatype that is available in SqlServer 2008. It uses materialized path.

Oracle

1. Start with and connect by: This is similar to the above method. It works on self-referential Table.

Object modeling trees

Imagine a scenario where you need to model a huge Family. I guess we start by having Person class. Each person has 0 or more children. Children is nothing but a collection of Persons. Mapping this to the data in database is a pain.

1. Lazy loading: Most probably you will have to lazy load the children as and when you need them. Else you may have to wait a generation to get the complete tree loaded.

2. If we want to implement things like Delete or reassignment, saving the data back to database will not be easy.

Better ways to store hierarchical data

Hierarchies are graphs. It is better to use a database like neo4j. Neo4j has been a very popular graph Db.





Coroutines – back to basics

27 12 2009

Ruby 1.9 Fibers has got me reading about Coroutines.
Thought I should put all my understanding somewhere, as I read and understand coroutines in more depth.

Most of the content in this post just a aggregation of various sources.

Coroutines are program components that allow multiple entry points and can return any number of times. Coroutines belong to a category of programming construct called Continuations.

All programming languages have one way or another to handle control flow. Within a control flow there is an associated state. This state is information like value of a variable etc. Callstack is one of the most popular way to store this information. Every method has its own call stack and this stack is erased once the method returns either normally or through exceptional Flow.

In a Coroutine this is not the case. We can suspend and resume execution without loosing the stack.

Types of Coroutines:

1. Symmetric Coroutines: A symmetric coroutine can use one function to yield and another to resume. Example: Lua

2. Asymmetric Coroutines: They are also called as semi-coroutines. The choice for the transfer of control is limited. Asymmetric Coroutines can only transfer control back to their caller. Example: Ruby 1.9  Fibers

Examples:

producer consumer


#!/usr/bin/ruby1.9.1

def producer
Fiber.new do
value = 0
loop do
Fiber.yield value
value += 1
end
end
end

def consumer(source)
Fiber.new do
for x in 1..9 do
value = source.resume
puts value
end
end
end

consumer(producer).resume

Fibonacci


#!/usr/bin/ruby1.9.1

fib = Fiber.new do
x, y = 0, 1
loop do
Fiber.yield y
x, y = y, x+y
end
end

20.times { puts fib.resume }

Why are coroutines important?

The main reason why coroutines are making the limelight again is because of concurrency. In my humble opinion, concurrency is reviving many of the well known but forgotten programming concepts back.

To take the example of ruby, most of us are aware of the Global Interpreter Lock. Threading in ruby is totally useless because ultimately all thread run as part of the same OS thread, which means there no true concurrency. Fibers in ruby are very similar to threads but are light weight threads. They can scheduled, suspended and resumed as per the programmers choice.

Coroutines can be used to construct the actor model of concurrency. This is the same model used by Erlang. Revactor is a very nice implementation of the actor model in ruby.

I will add code here when time permits.





I hate ORM

9 12 2009

The title is not meant to start a war over the concept of ORM. I appreciate the effort that has gone into mappers. But lets take a look at why I hate ORMs. (Dont hate me because I hate ORM 🙂 )

Prelude

I am beginning to wonder how many applications that we build really need a relational database.

Some terms become synonymous with their usage. For instance in the Xerox has become synonymous with Copiers.

Relational databases have almost become synonymous with Databases. As a developer or anyone involved in system design it is very important to know the options that are available to store data. The choice of persistence technology governs application scaling and performance in a very big way.

Now, Why do I hate ORM

ORMs hide the inconvenience that comes with using RDBMS with object oriented code.

When I learned relational modeling, I really liked it. I still do like making relational models.  But how long have relational databases been in existence. They were in existence much before the widespread usage of object oriented programming. Back then code was procedural. The relationship between data had to exist somewhere and it made sense to have it in the persistent store. Querying became easier.

But it was rather hard to switch older persistent stores with other technologies when we moved to object oriented code. Reasons were many. For example: availability skilled database developers, strong trust in RDBMS, good vendor support etc. But the move towards newer languages like C++, java and C# was inevitable. ORMs was win win solution to this problem.

Before ORM, all of us were known to writing a mapping layer ourselves. ORM was such a relief when it hit the markets. It set us free after years of wrangling with ugly mappers. But in the revelry we seem to have forgotten that it was database that needed a second look and not the codebase.

Now we have Duplication of relationships in data as well as in code. It is surprising that duplication of relationships has not struck us as problem.

Even frameworks like rails give us an impression that the standard way to build a web application is to use an RDBMS as a backend.

I simply cannot grasp the amount of effort we put into mapping object to schema. Another annoying issue is to have completed Database Design before starting development. Using Hibernate or Active Record on top of an existing schema is nothing less than tying oneself up in knots.

There is no point in great Object Oriented code if the system design is not appropriate. It is my humble thought that ORM should not be used as an excuse to choose Relational Databases over other options. As in any case use with Discretion.

Let me know what you think.





Consume REST webservices in java using rapa

24 01 2009

I have been reading about REST webservices for quite a while now. But rails still seems to have the best support for consuming REST webservices (ActiveResource). It is magical the way active resource works. Even if the magical and dynamic behavior may not be completely possible in java, it would be helpful if we have basic support for accessing REST webservices.

A few options which immediately pop up on a search would be restlet, apache cxf and jersey (reference implementation of jsr 311). But they are not as easy to use as ActiveResource and not very object oriented. The motivation behind rapa is to fill this gap. In this blog I will take you through creating a simple REST webservice with rails and then use rapa to consume it in an object oriented approach.

Rapa uses the tried and tested Jakarta Commons HttpClient and JAXB. It helps you in consuming REST webservice easily by taking care of the grunt work (making connection to the webservice, transporting data etc). All that said, its time for some code.

Prerequisites for this example:

If you are in a hurry, you can download the sample application code here.

If you want to get your hands dirty building the rails rest webservice and the rapa client yourself, below are the steps.
Firstly create the REST webservice in rails (which we will later consume in java).

rail rest --database=mysqlcd rest

ruby script/generate model customer id:number name:string

ruby script scaffold customer

then edit the database.yml file according to your database settings.

rake db:create
rake db:migrate

gentlemen start your server

ruby script/server

now access your rest webservice by pasting the below url in a browser
http://localhost:3000/customers.xml

<customers type="array">
</customers>

Since there is no data yet you will see only so much.
Now let us consume this webservice with rapa.
Download rapa. You will also need the below supporting jars.

Create a project with your favourite ide. Create a POJO class called Customer (or any name your think appropriate). This class has to implement the org.rest.rapa.resource.Resource interface.

/**
 * Purpose: Value object that represents the RestAPI
 */

import org.rest.rapa.resource.Resource;

public class Customer implements Resource {
	private String name;
	private int id;

	public String getName() {
		return name;
	}

	public void setName(String name) {
		this.name = name;
	}

	public int getId() {
		return id;
	}

	public void setId(int id) {
		this.id = id;
	}

}

Instanciate the org.rest.rapa.RestClientWrapper class as shown below.

RestClientWrapper restClientWrapper = new RestClientWrapper(
"http://localhost:3000/customers", "", "", "localhost", 9000);

The first parameter is the base url of the REST webservice. The second and third parameters are the username and password (In this case they are emty because our sample rails application does not require authentication). The fourth parameter is host name and fifth parameter is the port number.

Now we are ready use the webservice to create, read, update and delete with rapa.

Create

Customer customer = new Customer();
customer.setName("Hari");
restClientWrapper.save(customer);

Read

customer = (Customer) restClientWrapper.getById(1, Customer.class);
System.out.println(customer.getName());

Update

customer = new Customer();
customer.setName("rapa");
customer.setId(1);
restClientWrapper.update(customer);

Delete

customer = new Customer();
customer.setId(1);
restClientWrapper.delete(customer);

Rapa throws a checked exception called org.rest.rapa.RestClientException if an operation was not successful.

All the above operations in the example can be verified by looking at the rails log. For example the create method would result in the below log generated by the rails server.

Processing CustomersController#create (for 127.0.0.1 at 2009-01-24 23:20:16) [POST]
Session ID: fda551698b0faca8bed707f7148fe3de
Parameters: {"format"=>"xml", "action"=>"create", "controller"=>"customers", "customer"=>{"name"=>"Hari", "id"=>"0"}}
←[4;36;1mSQL (0.000000)←[0m   ←[0;1mSET NAMES 'utf8'←[0m
←[4;35;1mSQL (0.000000)←[0m   ←[0mSET SQL_AUTO_IS_NULL=0←[0m
←[4;36;1mCustomer Columns (0.031000)←[0m   ←[0;1mSHOW FIELDS FROM `customers`←[0m
←[4;35;1mSQL (0.000000)←[0m   ←[0mBEGIN←[0m
←[4;36;1mCustomer Create (0.000000)←[0m   ←[0;1mINSERT INTO `customers` (`name`, `updated_at`, `created_at`) VALUES('Hari', '2009-01-24 17:50:16', '2009-01-24 17:50:16
')
←[0m
←[4;35;1mSQL (0.031000)←[0m   ←[0mCOMMIT←[0m
Completed in 0.20300 (4 reqs/sec) | Rendering: 0.14100 (69%) | DB: 0.06200 (30%) | 201 Created [http://localhost/customers.xml]

Enjoy consuming REST webservice in java.

This project is still in very initial stages. Please send me your feedback.

Update: Please take a look at the latest improvements at github. The documentation has been updated as well.