Friday May 18, 2012
 

Speeding up EC2 work by using AWS tools and scripts to bypass the AWS management console

Believe me managing EC2 instances is not as simple or magical as marketers would like for you to believe. The cloud gets complicated when it gets bigger. EC2 really only enables a person to ignore power, network layout (which is bad), and getting more servers that end up costing more then actual servers fast.

Things that EC2 is missing to make life easier for the developer:

Ability to update all servers with packages and code. Natively they do not support the ability to push files or install new software packages to server groups. Thus install cluster-it and puppet and write your own deploy program.

Server names and the EC2 AWS management console do not match. Everything is referenced by instance ids. The interface does not allow one to launch many instances in a named pattern so you have to go back and sync up the instance with the internal named used in the app. Syncing up the names is very tedious and time consuming process through the console.

Assigning EBS volumes is a pain in the ass as well. Essentially you need to assign them one by one, where each add takes more then 2 mins todo this is not good time spent. For instance it took me roughly 20 mins to attach 8 63GB EBS volumes to a single server.

Amazon is perfectly aware of these limitations-and do not hide from it. They are going after something bigger. They are providing a platform. There is an API for everything you need to make your work easier. Now the pain is to learn the API and use it in your favor. There are companies that built a business on making a better interface for the AWS console but getting it done yourself is cheaper.

My personal mantra is to automate things that I have to do more then once. Anytime that I deploy new instances, I take the private IP add it to DNS, make an API call through ec2-describe-instances, find the instance id and update the name through ec2-create-tags. This solves the problem that I have with mapping instance ids to my internal name which the app uses. For instance:

#!/usr/bin/perl -w ##use strict;use Data::Dumper;open(HOSTS, "</etc/hosts") or die($!);

my $hosts = {};while(<HOSTS>){    my ($ip, $hostname, undef) = split(/\s+/, $_);    $hosts->{$ip} = $hostname;}

open(FH, "/opt/aws/bin/ec2-describe-instances -C cert.pem -K x509.pem --region us-west-1|") or die($!);while(<fh>){    if ($_ =~ /^INSTANCE\t(.*)/){        my (@fields) = split(/\s+/, $1);        # 0 - instance        # 1 - ami        # 2 - public dns        # 3 - private dns        # 4 - state        # 5 - ??        # 6 - ??        # 7 - instance type        # 8 - date created        # 9 - DC        # 10 - ??        # 11 - monitoring state        # 12 - public ip        # 13 - private ip        # 14 - ebs        # 15 - ??

     if ($fields[4] eq 'running'){            my $role;            my $hostname = $hosts->{$fields[13]};

            if (!$hostname) {                print "$fields[13] is not in the hosts file skipping..\n";                next;            }            if ($hostname =~ /^job/){                $role = 'gearman-worker';            }            if ($hostname =~ /^gearman/){                $role = 'gearman-queue';            }            if ($hostname =~ /^www/){                $role = 'webserver';            }            if ($hostname =~ /^memc/){                $role = 'memcache';            }            if ($hostname =~ /^db/){                $role = 'database';            }            if ($hostname =~ /^dbshard/) {                $role = 'database-shard';            }            if (!$role){                print "$hostname does not have a role\n";                $role = 'other';            }            system("./aws/bin/ec2-create-tags -C cert.pem -K x509.pem --region us-west-1 ".$fields[0] ." --tag Name=$hostname --tag Role=$role");        }    }}

Now to attach disks to an instance, that I am upgrading or re-purposing I wrote a quick script that describes the input instance after translating from my internal name to instance id. Calculates the size of each disk and attaches said disks. For instance:

#!/usr/bin/perl -w##use strict;use Data::Dumper;use POSIX qw(ceil);

print "Enter Hostname: ";my $hostname = <>;chomp($hostname);my $cmd = './aws/bin/ec2-describe-instances -C cert.pem -K x509.pem --region us-west-1 --filter="tag-key=Name" --filter="tag-value=' . $hostname . '"';

open(FH, "$cmd|") or die ("Awesome death: $!\n");

my $instance = "";my $lastDisk = "";my $diskCount = "";while(<FH>){

if($_ =~ /^INSTANCE\t(.*)/){        my (@fields) = split(/\s+/, $1);        $instance = $fields[0];        print "Instance=$instance\n";    }

    if($_ =~ /^BLOCKDEVICE\t(.*)/){        my (@fields) = split(/\s+/, $1);        $diskCount++;        $lastDisk = $fields[0];        print "$lastDisk\n";    }}

print "How many disks you would like to add: ";my $totalAddDisks = <>;chomp($totalAddDisks);

print "You picked $totalAddDisks\n";print "What is the total size of the Raid0 Array in GB: ";my $totalSize = <>;chomp($totalSize);

print "You picked $totalSize GB\n";my $sizeperdisk = ceil($totalSize/$totalAddDisks);print "The size per disk: $sizeperdisk\n";

$lastDisk =~ /sd(\S)/;my $lastDeviceLetter = $1;my @devicesavail = ($lastDeviceLetter .. 'z');

for(my $i = 1; $i <= $totalAddDisks; $i++){    $cmd = "./aws/bin/ec2-create-volume --size $sizeperdisk --region us-west-1 --availability-zone us-west-1c -C cert.pem  -K x509.pem";

    my $ret = `$cmd`;    my (@output) = split(/\s+/, $ret);

    $cmd = "./aws/bin/ec2-attach-volume --region us-west-1 -C cert.pem -K x509.pem $output[1] --instance $instance --device /dev/sd$devicesavail[$i]";    $ret = `$cmd`;

    if ($ret =~ /attaching/){        print "All good do the next one\n";    } else {        die("Did not work\n");    }}

These are rough and dirty scripts that get the job done for my environment. The end goal when given time is to turn these scripts into a package talking over httpd that makes life easier when working in EC2. Using these two script have reduced the management time from 1 hour per server upgrade to a few minutes.

Post to Twitter Tweet This Post

mySQL 5.6 looks very promising

I have been doing some research lately on various mySQL related features and found myself distracted by mySQL 5.6. If everything that I am reading is correct, this should be the best mySQL version yet.

mySQL 5.6 will have multi thread replication, making time delayed slaves-an actual feature and not a byproduct of replicating high write volumes. Also, crash safe replication! When a master crashed and corrupted the binary log this was a pain in the ass to fix. In the past I had to write scripts to walk the primary key and do a checksum on each returned table chunk and pick which row was correct and which one was not.

The performance optimization for innodb that addresses some stalls is as exciting as multiple SQL threads for replication. One major change is in the stat layer, which was throttling throughput for in memory workloads at high thread concurrency. You can read the details here.

The optimizer is getting an overhaul that has been needed for some time. This is exciting but by habit I still will probably just use (FORCE|USE) INDEX. Here is an example of using FORCE INDEX to get better performance.

This stat shows innodb hitting the disk. A single query was flopping between two indexes producing 33% MORE disk reads then necessary. Making a single line change on that query gave me more capacity in EC2. (Don’t hit the disk in EC2). In theory, the changes to the optimizer will reduce these flip-flops of indexes – but I doubt it will be as good as a human picking the index for their query.

Finally more and more stats from INFORMATION_SCHEMA which will probably create a flurry of bugs on the 1st couple days of release as the database hits large installs, and unexpected workloads exposes some stuff in the overall code – as everyone will start using information more to show dips in throughput.

In conclusion, 5.6 looks awesome and I can’t wait for a Percona Port.

Post to Twitter Tweet This Post

Sizing EC2 Servers to get more Bang for the Buck

To launch an app there was a tried and true process. The process in the past was to rent a server in a managed hosting facility. Get a few more as you grow, then build a model showing that it would be cheaper to get your own cage and finance servers. If the app continues to grow, build a model showing it would cost cheaper and allow for more rapid growth if you built your own datacenter after multiple requests for more power, space and moving to larger spaces with more power.  Now we have a new Step. If Step 1 is managed hosting then Step 0 is putting your application in the cloud i.e. Amazon’s EC2.
When an app grows cost grows at a multiple in EC2. To offset these costs, take periodic looks at your architecture to see if there is anything you can do to reduce overall cost. Let’s look at a tier that is easy to scale. Let’s look at application memory caching on Memcache as an example to illustrate reducing cost.
To scale network caches, add more memory. Make sure that the throughput of data going through the system doesn’t exceed your instance network bandwidth capacity and generally, you are good. (There are CPU concerns at high concurrency and pulling many bits at a huge rate). Simplifying things the cost to scale Memcache is really the cost per GB of memory in EC2.
Sizing your Memcache pool is based on a number of factors. What is being cached? How is the cache used? How many distinct items and how large is each item on average. For a dynamic app that caches database rows let’s base the cache on reducing repetitive queries to the database for users hitting the site. So the size should be based on how many daily active users the application has. The model is a function of daily active users (DAU) cacheable data size.  If the database is 200GB from 1 million users and 10K users use the site per day, cache 10K users worth of data instead of 200GB.  
Northern California DC prices us-west1 – has the best uptime and costs the most:
m1.large Amazon’s Large Instance Server has 7.5 GB of memory and costs $0.36 per hour per server. There are 744 hours in a 31-day month so this m1.large costs roughly $268 per month.
m1.xlarge is 15GB but has eight Compute Cores – you don’t need eight so lets skip that and look at the High memory plan.
m2.xlarge has 17.1 GB of memory and 6.5 Compute Cores and costs roughly $376 per month.
m2.2xlarge has  34.2 GB of memory and costs  $752
m2.4xlarge has  68.4 GB of memory and costs $1505
So the impulse might be lets get a bunch of m1.large since they are the cheapest at $268, but in actuality the price is not the cheapest for the scale needs.
m1.large costs roughly 5 cents per hour per GB while the high memory family costs 3 cents per hour per GB. Since the other scale condition is to not saturate your network bandwidth, having more points of access solves this with the additional benefit of loosing smaller sections of cache when a server goes down. (This is handwaving assuming the instances are not on the same server/rackswitch etc).
Therefore, the best bang for the buck is m2.xlarge (m2 family) at 3 cents per GB per hour for a monthly cost of $376. This is only 40% more then m1.large cost per month for 2.26 times the memory of m1.large. Now if the application doesn’t require 2.26 times the memory and its not being utilized then its not worth spending more money.

Post to Twitter Tweet This Post

mySQL Shards, Gearman, rabbitMQ, NodeJS, JSONP Push for Fault Tollerant Realtime Notification Feed

Over the past few weeks, I have been tweaking my feed system to be truly real-time. The hypothesis is, if end users are pushed content without forcing them to refresh a page they will interact on the content-which in turn turns a consumer into a producer of content causing virality within subnetworks of the Graph. The Graph is all the connections on the site while subnetworks are personal connections i.e. your friends, their friends etc. Additionally to note there are many more consumers then producers.

My hypothesis was wrong. Or is it?

I couldn’t accept the conclusions drawn from the usage data. It just makes sense that making something easier for a user should increase the chances for interaction. Thus I went digging and found a problem with my implementation of rabbitMQ. Before I go into the problem let’s talk about the set up.

A user is connected to the wwws through a load balancer. The www servers serve dynamic content to the end user. When a user produces content, that content is committed to a sharded database, based on their entityId. EntityIds are globally unique and there is a lookup table all in memory that says entityId belongs to shard X-and shard X is available for this user. After the data is committed a message is sent to Gearman, where Gearman workers will pump messages into RabbitMQ.

Now you may be asking yourself, why in the hell are you writing to a queue to write to another queue? (I too do not like this) Sometimes if to many messages are produced for rabbitMQ, rabbitMQ will block the producer. If the producer was the commit process (WWW Tier) the end user would have to wait until rabbitMQ unblocks to see the response and/or all the threads will block crashing the system and producing downtime. Gearman removes this problem.

The same user is also long polling a Node Server on a different subdomain (which requires a JSONP Long polling method). This server sits in front of rabbitMQ. NodeJS holds a persistent connection to rabbitMQ, keeps a single queue subscribe per user online, and missed messages when the user is not connected are resent if the user re-connects to Node. If the user does not reconnect then Node will remove the user from the list and destroy their queue. Only messages are sent to online users from PHP by asking node is this user online.

The end result to a user is if a person who is online comments, likes, shares their content the feed should be updated for all the people who can see the event and are also online, the content owner should be notified and it should fade in nicely.

In dev this works great, in production I would notice weird times that messages would be missed, causing me to refresh the page. This was the problem; I was missing messages. So what is going on? To understand the issue I use a method which I rule out the known and all that is left is the unknown. Most of the time assumptions are the cause of the problem.

Tracing the flow, DB commits are 100% correct. Publishing return codes show 100% correct behavior based on the return code, yet rabbitMQ’s rabbitmqadmin.py process was not displaying what the code was indicating.

The flow is as follows

#!/bin/bash

#

# declare the exchange as in-memory only and do not delete

#

./rabbitmqadmin.py declare exchange name=test_exchange type=direct durable=false auto_delete=false

#

# list exchanges

#

./rabbitmqadmin.py list exchanges

#

# declare a queue with the same settings as the exchange

#

./rabbitmqadmin.py declare queue name=test_queue auto_delete=false durable=false

#

# list the queue

# 

./rabbitmqadmin.py list queues

#

# established a route from the exchange to the queue by binding

# 

./rabbitmqadmin.py declare binding source=test_exchange destination_type=queue destination=test_queue routing_key=myroute

#

# list the route

#

./rabbitmqadmin.py list bindings

#

# publish the messages 3 should be waiting for consumption

# 

./rabbitmqadmin.py publish exchange=test_exchange routing_key=myroute payload="this is a messages"

./rabbitmqadmin.py publish exchange=test_exchange routing_key=myroute payload="this is a messages"

./rabbitmqadmin.py publish exchange=test_exchange routing_key=myroute payload="this is a messages"

sleep 5;

#

# show the queue message numbers

#

./rabbitmqadmin.py list queues 

This is basically the flow that my PHP layer  uses. This shell script is my control case. It works and does exactly what I want. If a consumer is not present the message should queue. I found that PHP does not due this or even send the message at times.

 $ret = $this->channel->publish(json_encode($data), $this->routingKey, AMQP_IMMEDIATE, array('Content-type' => 'text/json', 'delivery_mode' => 1, 'expiration' => self::DEFAULT_MESSAGE_TIMEOUT));

The AMQP Pecl lib is far from mature. First expiration does not work from my tests. Second AMQP_IMMEDIATE causes memory leaks and inconsistent results. This was found out after turning off all the switches to duplicate my control, then searching google about AMQP_IMMEDIATE problems.

Below shows this Fix. This stat that I am monitoring indicates people interacting. Notifications are being shown to the end user automatically if the user is online.

Still something was not correct.  Good thing I have ganglia. Take a look at these stats. Notice that messages unack.  It’s huge. This means to me that messages are being missed and not being sent to users.

NodeJS is an event based system. Things happen asynchronously which state bugs are a plenty under load. I found such a state bug which caused an “object” from being initialized before the connection was ready.  Once fixed see the smooth growth in the first image (Node Fix).

Now for some stats. One Linux-AMI Amazon EC2-west c1.xlarge runs RabbitMQ, and NodeJS.  A peak of 9876 users where connected to Node, sending thousands of messages a second, with a 5 min load average of 1.62 where rabbit uses the most CPU/Memory and node uses very little CPU and memory. All and all the shards are handling 10s of thousand of writes a second, gearman is handling 10s of thousands of job insertions of second. Rabbit is handling thousands of messages a second, and for the end user within 5-30 seconds all their friends see their activity.

In conclusion is my hypothesis correct? Time will tell especially now that everything is working smoothly.

PS: Things I like to fix are message ordering and getting rid of Gearman as a layer.

Post to Twitter Tweet This Post

mySQL Shards, Gearman, rabbitMQ, NodeJS, JSONP Push for Fault Tollerant Realtime Notification Feed

Over the past few weeks, I have been tweaking my feed system to be truly real-time. The hypothesis is, if end users are pushed content without forcing them to refresh a page they will interact on the content-which in turn turns a consumer into a producer of content causing virality within subnetworks of the Graph. The Graph is all the connections on the site while subnetworks are personal connections i.e. your friends, their friends etc. Additionally to note there are many more consumers then producers.

My hypothesis was wrong. Or is it?

I couldn’t accept the conclusions drawn from the usage data. It just makes sense that making something easier for a user should increase the chances for interaction. Thus I went digging and found a problem with my implementation of rabbitMQ. Before I go into the problem let’s talk about the set up.

A user is connected to the wwws through a load balancer. The www servers serve dynamic content to the end user. When a user produces content, that content is committed to a sharded database, based on their entityId. EntityIds are globally unique and there is a lookup table all in memory that says entityId belongs to shard X-and shard X is available for this user. After the data is committed a message is sent to Gearman, where Gearman workers will pump messages into RabbitMQ.

Now you may be asking yourself, why in the hell are you writing to a queue to write to another queue? (I too do not like this) Sometimes if to many messages are produced for rabbitMQ, rabbitMQ will block the producer. If the producer was the commit process (WWW Tier) the end user would have to wait until rabbitMQ unblocks to see the response and/or all the threads will block crashing the system and producing downtime. Gearman removes this problem.

The same user is also long polling a Node Server on a different subdomain (which requires a JSONP Long polling method). This server sits in front of rabbitMQ. NodeJS holds a persistent connection to rabbitMQ, keeps a single queue subscribe per user online, and missed messages when the user is not connected are resent if the user re-connects to Node. If the user does not reconnect then Node will remove the user from the list and destroy their queue. Only messages are sent to online users from PHP by asking node is this user online.

The end result to a user is if a person who is online comments, likes, shares their content the feed should be updated for all the people who can see the event and are also online, the content owner should be notified and it should fade in nicely.

In dev this works great, in production I would notice weird times that messages would be missed, causing me to refresh the page. This was the problem; I was missing messages. So what is going on? To understand the issue I use a method which I rule out the known and all that is left is the unknown. Most of the time assumptions are the cause of the problem.

Tracing the flow, DB commits are 100% correct. Publishing return codes show 100% correct behavior based on the return code, yet rabbitMQ’s rabbitmqadmin.py process was not displaying what the code was indicating.

The flow is as follows

#!/bin/bash

#

# declare the exchange as in-memory only and do not delete

#

./rabbitmqadmin.py declare exchange name=test_exchange type=direct durable=false auto_delete=false

#

# list exchanges

#

./rabbitmqadmin.py list exchanges

#

# declare a queue with the same settings as the exchange

#

./rabbitmqadmin.py declare queue name=test_queue auto_delete=false durable=false

#

# list the queue

# 

./rabbitmqadmin.py list queues

#

# established a route from the exchange to the queue by binding

# 

./rabbitmqadmin.py declare binding source=test_exchange destination_type=queue destination=test_queue routing_key=myroute

#

# list the route

#

./rabbitmqadmin.py list bindings

#

# publish the messages 3 should be waiting for consumption

# 

./rabbitmqadmin.py publish exchange=test_exchange routing_key=myroute payload="this is a messages"

./rabbitmqadmin.py publish exchange=test_exchange routing_key=myroute payload="this is a messages"

./rabbitmqadmin.py publish exchange=test_exchange routing_key=myroute payload="this is a messages"

sleep 5;

#

# show the queue message numbers

#

./rabbitmqadmin.py list queues 

This is basically the flow that my PHP layer  uses. This shell script is my control case. It works and does exactly what I want. If a consumer is not present the message should queue. I found that PHP does not due this or even send the message at times.

 $ret = $this->channel->publish(json_encode($data), $this->routingKey, AMQP_IMMEDIATE, array('Content-type' => 'text/json', 'delivery_mode' => 1, 'expiration' => self::DEFAULT_MESSAGE_TIMEOUT));

The AMQP Pecl lib is far from mature. First expiration does not work from my tests. Second AMQP_IMMEDIATE causes memory leaks and inconsistent results. This was found out after turning off all the switches to duplicate my control, then searching google about AMQP_IMMEDIATE problems.

Below shows this Fix. This stat that I am monitoring indicates people interacting. Notifications are being shown to the end user automatically if the user is online.

Still something was not correct.  Good thing I have ganglia. Take a look at these stats. Notice that messages unack.  It’s huge. This means to me that messages are being missed and not being sent to users.

NodeJS is an event based system. Things happen asynchronously which state bugs are a plenty under load. I found such a state bug which caused an “object” from being initialized before the connection was ready.  Once fixed see the smooth growth in the first image (Node Fix).

Now for some stats. One Linux-AMI Amazon EC2-west c1.xlarge runs RabbitMQ, and NodeJS.  A peak of 9876 users where connected to Node, sending thousands of messages a second, with a 5 min load average of 1.62 where rabbit uses the most CPU/Memory and node uses very little CPU and memory. All and all the shards are handling 10s of thousand of writes a second, gearman is handling 10s of thousands of job insertions of second. Rabbit is handling thousands of messages a second, and for the end user within 5-30 seconds all their friends see their activity.

In conclusion is my hypothesis correct? Time will tell especially now that everything is working smoothly.

PS: Things I like to fix are message ordering and getting rid of Gearman as a layer.

Post to Twitter Tweet This Post

Asyncronous Shard Queries in PHP using mysqlnd enabling the feed load 10 times faster.

A few years ago I wrote about Asynchronous Shard Queries verses Synchronous Shard Queries, and in this post I talked about having to write a server to handle this for me in Java. Now I do it in PHP and got great results that are posted below.

Building the Feed was taking 100ms up to 40 seconds on the initial load, if the feed is out of cache. This is not acceptable for me as an engineer or the users I serve. Although the 40 seconds was rare it still is wrong.

The problem is as I added more shards to handle our data growth, the feed got proportionally slower. To build the feed for large users I would have to hit each server or at the very least a large percentage of them. Adding capacity made things slower in my Synchronous World.

My options to fix this issue where, do what I did in the past which, is boring, or try something new. In the past I wrote a server using Java with a Jetty Core that took a request and sent a merged response to get around php not having the ability to be parallel in a web context. It worked flawlessly but if that server went down users would not be able to see their info. Really its just another moving piece, another thing for me to monitor and at my company there is just 4 engineers to handle 3 million (and growing) DAU.

The new option was to use a mysqlnd option that enables asynchronous queries. Since I am running PHP 5.3.10 its an option that is available as a shared mod. and works with PDO except for this feature.

PDO is like Perl’s DBI (but better) its an abstract that enables you to switch SQL database servers without having to change your method calls. The problem is PDO doesn’t have the concept of asynchronous queries but this is really not an issue. I suggest for you to always write wrappers around used API’s so you can switch the underlying (vendor) api without having to change multiple places in your code. I did such thing to PDO we call it Shard which is a wrapper around the PDO object. Then through polymorphism I am able to change query, connect, fetch equivalents to support this mysqlnd feature.

Below is the code

ShardDB::getAsyncShardInstance($pool)->setSide($userId)->query($query)->fetchAll();

ShardDB is a static class that has a method getAsyncShardInstance for a given shard pool. Really the construct design pattern is a singleton which returns an async shard object that extends shard. (I love objects things just work and is clean).

setSide is optional to say keep the query on the side which the calling user uses unless that side is down.

query connects to the servers if not connected and executes the sql.

fetchAll polls for the responses and is equivalent to PDO’s fetchAll.

Here is the query code:

public function query($query, $args = array()) {

        foreach($this->connectionList as $shardId => $connection) {            Debugger::timer('async_query');            $connection->query($query, MYSQLI_ASYNC);            Debugger::log("ASYNC_QUERY", $query, 'async_query');        }        return $this;}

Then fetch all

public function fetchAll() {        $connectionCount = count($this->connectionList);        Debugger::timer('async_poll');        $results = array();        $processed = 0;        do {

            $links = $errors = $reject = array();            foreach ($this->connectionList as $i => $link) {                $links[] = $errors[] = $reject[] = $link;            }

            if (!mysqli_poll($links, $errors, $reject, $this->defaultTimeout)) {                $spinned++;                continue;            }

            foreach ($links as $link) {                if ($result = $link->reap_async_query()) {                    while ($row = $result->fetch_object()){                        $results[] = $row;                    }                    $result->free();                    $processed++;                }            }

        } while($processed++ < $connectionCount);        Debugger::log("ASYNC_QUERY", "Poll is done - Spinned $spinned times", 'async_poll');        return $results;} 

The affect of this code is illustrated below.

The blue line is greater than 2 seconds, which is bad then the line goes away the the change above. There is still some work needed to totally get rid of the greater than 2 second calls but what I got from this one change was worth the effort.

Thanks to mySQL and the internet super hero for building this feature. It was like it was built just for me!

Post to Twitter Tweet This Post

SpyMemcached Transcoder with PHP PDO Objects using ZLIB

My technology stack services more then 2 Million Daily Active users.  Its very basic. PHP talks to mySQL, Memcache, RabbitMQ, Gearman and Facebook.  Now that we have more Java specifically to support our SmartFox Server and other services, blurring the lines between what data is set in PHP and what data is read in Java is very necessary.

Java J-Connect makes reading mySQL Data as simple IMHO as PHP’s PDO. What is hard is reading PHP’s serialized format from PHP’s Memcache library.

In PHP there are two main C backed Libraries. There is Memcache the original PHP library which I happen to use, and Memcached which is the library I wanted to use but didn’t deploy since EC2 package system conflicted and cause issues (I fixed them but to late to deploy). Memcache stores data in PHP’s serialized format and compresses it via ZLIB, while Memcached can store data as PHP’s serialized format, JSON, Binary Serialized (which is rather awesome), JSON Array Notation and has a multitude of compressing formats none of which are pure ZLIB that I noticed.

Here is the problem. Spymemcached is a lib for talking to memcache but can’t unserialized PHP serialized format (or read it natively and return a string) and cannot Decompress ZLIB but can Decompress GZIP. Now a great speed up would be to use PHP’s serialized data set from PHP and share memcache resources from PHP and Java just like what is done for the mySQL resources.

What needs to be done? Well, build your own Transcoder for Spymemcached. Fortunately Spymemcached documented an interface to do just that.

What is needed. Implement the spymemcached Interface defined here. Use org.lorecraft.phparser to unserialize PHP data  defined here. Return the Object.

 Below is the code.

package com.schoolfeed.spymemcached;

import java.io.ByteArrayInputStream;import java.io.ByteArrayOutputStream;import java.io.IOException;import java.util.zip.InflaterInputStream;

import net.spy.memcached.CachedData;import net.spy.memcached.compat.CloseUtil;import net.spy.memcached.transcoders.BaseSerializingTranscoder;import net.spy.memcached.transcoders.Transcoder;import org.lorecraft.phparser.*;

public class PHPSerializedTranscoder extends BaseSerializingTranscoder implements Transcoder <Object> {

 static final int COMPRESSED=2; /**  * Get a serializing transcoder with the default max data size.  */ public PHPSerializedTranscoder() {  this(CachedData.MAX_SIZE); }

 /**  * Get a serializing transcoder that specifies the max data size.  */ public PHPSerializedTranscoder(int max) {  super(max); }

 /**  * decode the byte data from Memcache decompress it if necessary and return the Object  * @param CacheData - the byte data is turned into a object  * @returns Object   */ public Object decode(CachedData d){  byte[] data=d.getData();

  Object rv=null;  String ds="N;";

  if((d.getFlags() & COMPRESSED) != 0) {   getLogger().debug("Looks like d is compressed");   data=decompress(d.getData());  }

  ds=decodeString(data);

  getLogger().debug("DECODED: [" + ds + "] about to SerializedPhpParser");

  SerializedPhpParser sp = new SerializedPhpParser(ds);

  try {   rv = sp.parse();   getLogger().debug("Parse was cool!!");  } catch(Exception e){   getLogger().debug("Not a PHP Object? : " +  ds);   rv = ds;  }

  return rv; }

 /**  * PHP Memcache stores compress data in ZLIB format override the base class decompress method to handle ZLIB  *   * @param byte array - raw data from Memcache  * @returns byte array  */ protected byte[] decompress(byte[] in) {  ByteArrayOutputStream bos=null;  final int BUFFER = 2048;  if(in != null) {   ByteArrayInputStream bis=new ByteArrayInputStream(in);   bos=new ByteArrayOutputStream();   InflaterInputStream iis = null;   try {    iis = new InflaterInputStream(bis);

    byte[] buf=new byte[BUFFER];    int r=-1;    while((r=iis.read(buf, 0, BUFFER)) > 0) {     bos.write(buf, 0, r);    }   } catch (IOException e) {    getLogger().warn("Failed to decompress data", e);    bos = null;   } finally {    CloseUtil.close(iis);    CloseUtil.close(bis);    CloseUtil.close(bos);   }  }

  return bos == null ? null : bos.toByteArray(); }

 /**  * encode -- not implemented yet  *  */ public CachedData encode(Object o){  int flags = 0;  byte[] b=null;  return new CachedData(flags, b, getMaxSize()); }

 /**  * no need to async Decode let's do it realtime  */ public boolean asyncDecode(CachedData d) {  return false; }

}

This is a stop-gap solution until we make the transition to Memcached with JSON encoding. Then I can use Jackson-JSON - which is a fast JSON encoder/decoder for Java enabling a great portable message protocol between the two stacks and nearly any other language we might add to the system (like Python).

Post to Twitter Tweet This Post

Basic ETL with Gearman and MySQL in a few lines of PHP code

Gearman is awesome. If you do not know what it is, its a queue and load balancing system for an arbitrary number of workers which enables distributed computing across many nodes. Some of the same guys who worked on mySQL source worked on Gearman.

Feel free to search my blog on other gearman uses.

The Problem:

We store a lot of stats, make a lot of changes and we want to see the result of the stats in realtime. Our stat system is pretty slick. For each tag increment the application increments a count and group said tag by minute, hour, month with a hash tag numeric representation of the text for compact writes. This means 1 tag write produces 4 SQL statements. We track over 239211 distinct tags at around 10K Writes per second on a single mySQL instance on EC2 in a RAID-10 EBS xtra-large Config (although because EBS mirrors internally I can just raid-0 but I was too scared).

Once the mySQL instance hits disk (EBS) our throughput becomes very unstable, possibly slowing down the site.The solution was to defer these writes but how can I do it without building a logging system, aggregator, loader and having a bunch of moving parts? Really I want to only spend 10 mins on this problem and use existing monitoring code.  So my 10 min solution:

 3 mins to write the code
 6 mins testing
 55 seconds of patting myself on back
 5 seconds to deploy

Solution Detail:

Since Gearman workers connect to the GearmanD server’s Job QUEUE and loop for more Jobs. This means program stays in memory for the length of the process (until worker restart). This means I can transform the data in application memory. Since the program is persistently connected to the DB that means I can periodically load the data in chunks.

In stead of having 100s of possible concurrent connections doing writes I can control the writes based on the number of workers. Innodb is very fast and consistent at low levels of concurrency (less then 50).

Since I can drain the queue from GearmanD and transform the data locally I do not really need to worry about running out of memory on the queue server. The consumer is faster then the producer. I can combine 1000s of writes into a single write.

Let’s look at some code:

<?    require_once("includes/config.php");    require_once("includes/DB/EventTrackerDB.php");

    class EventTrackerETL {

        //        // keep track of distinct tags        //        public static $eventTable  = array();

        //        // the next flush        //        public static $nextWrite   = 0;

        //        // keep a stat of total writes        //        public static $totalEvents = 0;

        //        // total number of events        //        const MAX_NUMBER_OF_EVENTS = 30000;

        //        // number of seconds to pause        //        const FLUSH_INTERVAL = 20; // seconds

        /*         * transform all the tags into a sum of the counts entered         *  @params string $event    - the tag being incremented         *  @params int $count       - the supplied count many times its just 1         *  @params int $timeEntered - EPOC timestamp         *  @retuns void         */        public static function transform($event, $count, $timeEntered) {

            //            // initialize or increment a tag            //            if (isset(self::$eventTable[$event])){                self::$eventTable[$event] += $count;            } else {                self::$eventTable[$event] = $count;            }

            //            // flush if we hit the max number of events            //            if (sizeof(self::$eventTable) > self::MAX_NUMBER_OF_EVENTS){                return self::load();            }

            //            // flush if its our time            //            if (self::$nextWrite < time()){                return self::load();            }            return;        }

        /*         * flush the stored tags to the database         *  @returns void         */        protected static function load() {

            //            // write transformed events to the db            //            $thisRun = 0;            foreach(self::$eventTable as $event => $sum) {                EventTrackerDB::singleton()->updateEvent($event, $sum);                $thisRun++;                self::$totalEvents++;            }

            $msg = "EventTracker write complete $thisRun events this run and a total of " . self::$totalEvents . " events written so far";            Debugger::log("OT", $msg);

            //            // re-init            //            self::$nextWrite = time() + self::FLUSH_INTERVAL;            self::$eventTable = array();        }    }?>

In summary with gearman I am able to process 250K events in seconds. The queue never builds up and there is special code to handle kills (not SIGKILL).

Post to Twitter Tweet This Post

Error injection tests for InnoDB would be nice

I am trying to figure out why an InnoDB table was lost when a DDL statement failed. I think it was a RENAME TABLE statement. I have yet to find the root cause but I did find that InnoDB doesn’t report some errors when RENAME fails so the user thinks that the table was renamed, the FRM file is renamed, and the ibd file is not renamed. This is only a problem for files not in the InnoDB system tablespace so –innodb_file_per_table=1 must be used. This is bug 64144.

As I wrote in a previous blog post, it is time to add error injection tests to InnoDB.

Post to Twitter Tweet This Post

Big & small stuff in ASE 15.7, and a note for ‘Peter’

Last week I did my part in a 5-episode webcast series about Sybase ASE 15.7 – the latest and greatest incarnation of your favorite OLTP DBMS (and not just yours: ASE is also SAP’s preferred OLTP DBMS; see here for more).
In case you missed it: ASE 15.7 was released on September 28th, and can now be downloaded by anyone with a Sybase support contract. For everyone else, the Developer Edition and the Express Edition for ASE 15.7 are also available now.
Specifically for a guy named Peter, I can now finally shed some light on a tech question he brought up — so Peter, since I did not have your email address, keep reading…

ASE 15.7 is packed with new functionality. I cannot actually remember an ASE release with so many new features as ASE 15.7. I guess that’s the reason why the webcast comes in 5 installments.
My part of the webcast was about application development features. This means: enhancements to SQL functionality or syntax whose focus is to deliver better (or more) functionality – as opposed to better performance or shorter maintenance downtime, for example (which ASE 15.7 also has, but that’s the topic of subsequent webcasts). You can listen to the webcast recording here.

Let me just briefly recap the highlights of these application-developer-oriented features in ASE 15.7.
First, there have been enhancements to handling of text/image data (collectively known as LOB data). For as long as I can remember, customers have asked for the ability to declare SQL variables and parameters of the text or image datatype. In ASE 15.7, that is now possible. This can be particularly handy when doing XML processing on mature XML documents, which can easily exceed the maximum length of 16KB for a varchar variable.
Another LOB enhancement is known as “LOB locators”. This is best described as a “pointer” to a LOB variable. When a client application wants to operate on a LOB value (like truncating/overwriting/concatenating it, etc.), ASE 15.7 can send the LOB locator to the client rather than the full LOB value, which may be big. The client app then operates on the locator with new (and existing) T-SQL statements which are sent back to the ASE server where the operation on the LOB value is effectuated. This improves efficiency since no big LOB values need to be exchanged between client and server, but only the small (24 bytes) LOB locator values instead.
If this all sounds a bit cryptic, check out the webcast slides for more detail and examples.

Another enhancement in ASE 15.7 is the MERGE statement. I have blogged about this before, so see there for more details. Suffice to say here that MERGE is a more efficient way of inserting-or-updating a bunch of rows into another table than you could ever code yourself with separate insert an update steps.
This is the point where Peter comes in. At the end of the webcast one listener with that name asked a question which I was unable to answer at the time. His question was whether you can force an index in the MERGE statement in the same way as for other types of query. After consulting with the responsible engineering team, the answer has to be that, unfortunately, this is currently not supported; the same applies to other query plan forcings (I/O size etc) as well as to specifying an abstract query plan. Should you try this, you may find that it actually works fine. However, be warned: it may also just as well not work, since there is currently a number of known issues when using these clauses for MERGE. For that reason, we recommend you do not try to use any query plan forcings around the MERGE statement for now.

There is also a new feature that is perhaps less about app development and more about performance after all. In a nutshell, by enabling the config parameter ’streamlined dynamic SQL’ (disabled by default), a series of internal optimizations around query processing becomes active (not just for dynamic SQL, but for many other types of queries too). This should translate into better performance although it is -as always- hard to predict how much gain you may see in your system. So I’m not going to speculate. Instead, please let me know what you find. There is no reason not to use this optimization, so please give it a try…

Lastly, there are some features that qualify as ’small’ in comparison to the other topics discussed. These include syntax enhancements; for example, certain constructs are now allowed in subqueries that previously caused an error.
As it happens, my favorite new ASE feature falls actually in this category. If you want to convert the number 123 to a 6-character string, and pad the result with leading ‘0′ characters (i.e. ‘000123′) in pre-15.7 you had to use the following expression (figuring out how this works is left as an exercise to you, reader):

select right(replicate('0',6) + convert(varchar,123),6)

In ASE 15.7, you can now simply specify the padding character as a third argument to the str() function, so the following is now sufficient:

select str(123, 6, '0')

Any padding character can be specified (a ‘*’ would result in ‘***123′).
Admittedly, it did probably not require a major effort by Sybase’s most talented engineers to implement this enhancement. But I always really like such little features that make my life easier when coding SQL. Small can certainly be beautiful.

(Coming up in the next webcasts: some ASE new features and enhancements that are really big. Those are cool too. Stay tuned.)

Follow rob_verschoor on Twitter

Post to Twitter Tweet This Post