How to use Kafka Streams

In this blog post I’m going to write about how to use Kafka Streams for stateful stream processing. I’m going to assume that the reader knows the basic ABC’s (producers, consumers, brokers, topics) of Apache Kafka.

Problem Statement: We needed a system which would consume messages, perform real-time fast stateful processing of these messages, and then forward these processed messages downstream with features like scalability, fault-tolerance, high throughput and millisecond processing latency etc.

 

Historically at Yesware, we had used RabbitMQ as our messaging system with good results so far. But at the time when we had this requirement, the latest 0.10 release of Apache Kafka had introduced a new feature called Kafka Streams. So our options were:

  1. Use RabbitMQ (which we were quite familiar with) for messaging and at the downstream consumer level use a fast in-memory data store (like Redis – which we also had used quite extensively) to do stateful processing, and forward the processed messages to the final destination
  2. Use Kafka for messaging (which we had used sparingly), and try our luck with this new thing called Kafka Streams which looked quite promising – mainly because it builds on the highly scalable, elastic, distributed, fault-tolerant capabilities integrated natively within Kafka.

We decided to choose Kafka for messaging, because we had expected a big chunk of messages traffic per second and also just because we wanted to expand our boundaries. Now, with Kafka as our messaging service, we still could have used a stream processor framework existing outside of the Kafka system but then we would have faced lot of complexity (picture on the left), so choosing Kafka Streams was quite tempting indeed (picture on the right).

comparison kafka

So let’s quickly go over the basic concepts, before we dive into the code snippets.

Continue reading How to use Kafka Streams

Automated UI Testing for Native Windows Applications

Yesware has a robust culture of automated testing. We currently have no QA department where handing over manual testing bloats the “testing” phase of our build, test, release cycle. Instead, engineers at Yesware write unit tests and UI tests along with any patch they are trying to merge into our master branch. This allows us to confidently and quickly add features, refactor code, clear up tech debt, and update underlying dependencies, relying on our suite of tests to point out the exact issue(s) on any specific set of changes. We avoid lengthy development schedules that are slow to complete, cumbersome to change or adapt, and fragile in the face of the unexpected.

Automated UI tests are finicky and can be expensive to write and maintain. The choice in tooling makes a huge difference in the tests’ efficiency, reliability, and maintainability. The tooling should hide general but nitty gritty details of UI automation while allowing the developer to extend and customize to meet their specific requirements.

Yesware has much experience in writing and running automated UI tests for our web applications, and we have benefited much from the fruit of the open source community for creating, maintaining, and resolving issues for tools like capybaraSelenium WebDriver, WebKit implementations, and so many more. Something we didn’t have much experience in was automated testing for native Windows applications, or in the case of Yesware for Outlook, automated testing for a plugin to Outlook.

While trying to choose technologies to adopt, we recognized that writing automated UI tests when we didn’t control the host application would pose a challenge, so any tool that came with documentation and sample UI automation for Microsoft Office Addins would have a large leg up in our evaluation. So while we became aware of TestStack/White which builds on top of the UI Automation framework, we favored Microsoft’s offering: Coded UI. It had ample documentation and videos, and we found two sample UI automation projects covering how to use Coded UI for Office Addins.

On the surface, Coded UI looked very promising.

  • It was an official Microsoft offering
  • Visual Studio Premium already came with it bundled
  • Lots of documentation, including the aforementioned samples, videos, and MSDN documentation and walkthroughs
  • Promises to hide and abstract away the accessibility and automation layers
  • Mitigated the risk of targeting Office addins by demonstrating with working examples

However, Coded UI turned out to be a very difficult option.

The generated code was very verbose. For example, Coded UI generated 500 lines of code to

  • Launch Outlook
  • Open a message composer
  • Compose to “someone@example.com”, subject “hi”, body “hello”
  • Send the email

500_lines_of_code

500 lines of code to compose and send an email

Such verbose code was not a problem on its own, but since the code was so unreliable for our use case, its verbosity made diagnosing and resolving issues a nightmare.

The accessibility properties in Outlook changed in very subtle ways when our code changed. The code verbosity made it a challenge both to pinpoint these subtle discrepancies and to code an elegant resolution. Even very small changes to our application, Yesware for Outlook, could require a subtle correction in our automated UI test that was difficult to identify and implement. Larger changes to our application could require very extensive changes to the UI test.

The Coded UI search engine was a flaky black box that would intermittently fail. These failures appeared maddeningly similar to subtly incorrect search criteria. However, we ruled out that the error was in our search criteria by successfully finding the target element in another instance of the Coded UI search engine with the exact same search criteria.

As a silver lining, since we needed to adjust Coded UI’s behavior at the Windows automation layer, we became more familiar with Windows automation. We were able to leverage this experience as we moved to White.

White is much more expressive and readable than Coded UI. The previous example where Coded UI generated over 500 lines of code could be written in 70 lines using White for the exact same functionality.

automated_ui_test

Running automated UI test written in White. Try out the demo.

As you can imagine, maintaining 70 lines of code is much easier than maintaining 500 lines.

As daunting as the UI automation may appear (try checking out the MSDN UI Automation overview), we have found that searching by the “class name” and by the “text” accessibility properties has been consistently sufficient to retrieve the desired UI elements. White did a good job hiding the many details of Windows accessibility that were none of our concern.

White doesn’t generate the search properties for you, so use the Windows Inspect tool to help you identify the values of the search properties of the elements you are trying to interact with.

As we’ve mentioned, we run our tests continuously, so we also have our continuous integration server run our UI tests on every candidate patch. Jake Ginnivan wrote a great guide on how to setup a TeamCity build agent on Azure to run automated UI tests. While some details may be outdated, the crux consists of the following broad strokes

  • set up the build agent on an Azure VM
  • set up a persistent graphical UI session (necessary for Windows UI automation)
  • run the TeamCity build agent within that UI session

TestStack.White is not without its flaws. While it is a vast improvement over Coded UI for us, we’ve stumbled over a couple of parts of White. Their issues page is a pretty comprehensive list. Development isn’t particularly active, but the project has great potential and plenty of areas to improve. For example, Coded UI had means to catch playback issues and retry while White does not.

The UI automation tools for native Windows testing have lagged behind web-based tools. While there is a large ecosystem of tools to choose from to run automated UI tests against your web application, we found only two major contenders for automated UI testing against native Windows applications. We hope this article helps you make a more informed decision on how to UI test your Windows application. Writing and maintaining these UI tests have been a long road for our team, but it has gone a long way to catching UI bugs in our code before they are deployed and caught by our users.

Introducing YetiLogger

In this blog post I’m going to introduce the yeti_logger gem. This is Yesware’s shared logging mechanism for our Ruby code. You might be wondering why such a need exists. Ruby has pretty decent logging via the Logger class and Rails builds upon that. So why another layer? For us it came down to a few reasons:

  • abstraction
  • efficiency
  • format

Keep reading for some background on each of these reasons and a high level overview of how to use it, or feel free to jump straight to the GitHub repo (https://github.com/Yesware/yeti_logger) to check out the Readme.

More on the why?

As with all things software, we tend to change our minds on things here at Yesware. Logging is not necessarily one of them, but it’s something we’ve toyed with in the past. We wanted to be sure that all of the great logging we were adding to the product wouldn’t result in a refactoring nightmare if we switched from using Rails.logger to something else. Also, to make things more complicated, we have some non-Rails applications that use Logger directly, and a very few places that even use puts (most of our production apps run on Heroku, where our logs are sent to stdout anyway, so this isn’t as bad as it sounds). Having a nice abstraction layer over whichever logger we wanted to use seemed like a good idea.

Another aspect of abstraction is the idea that it should be very easy to log (and to stop logging). By making YetiLogger a mixin module, you get methods you can call directly on your class or class instance, such as log_info("hello"). By having these methods live on your class/module/instance, it makes it much easier to replace YetiLogger, or just bypass it by defining a method on your class/module/instance with the same signature of those from YetiLogger. For example to temporarily bypass all calls to YetiLogger#log_info, define a dummy method on your class that does what you like with it:

def log_info(message)
  # Code to stick message wherever you like, such as a database, queueing system, or nowhere!
end

We also wanted more control over the performance impact of our logging. Fortunately we’re not in a position that we’ve been able to point to our logging as a major source of performance drain on our application. That said, we didn’t want to get ourselves into such a predicament. The formatting of a string to emit to a log is generally fast, but sometimes we want to compute or fetch additional information to help with later debugging. Additionally there were some efforts to remove log statements from the code that were set to log at levels below where we typically ran. While this was fine and all, we really wanted to have the freedom to leave some poor-performing code in and be assured that unless we needed to turn it on, it would remain off. As you can see below, we accomplish that with Ruby blocks being passed to YetiLogger methods. Since the block’s evaluation is deferred, we can optionally check whether or not we need to even evaluate it (and thus even build the string to log) based on the logging level. Unfortunately, Ruby doesn’t have the notion of lazy evaluated method parameters, so we couldn’t rely on that. Blocks gave us what we were looking for. We can now fill calls to YetiLogger with inefficient blocks that lookup all the information we’d want to debug something knowing that unless we crank up the log level they’ll safely not impact performance.

The last, and probably biggest reason for YetiLogger is one of consistent formatting. We’ve tried out several logging tools and are currently big fans of SumoLogic. It’s great for consolidating all of our production system logs and giving us a nifty way to search through them. One thing that quickly came up after we began using our logs more was that the formatting of log messages was all over the place. Parsing structured data is much easier than extracting information from plain text. Take these two log messages for example:

user_id=37 email=yeti@yesware.com msg=login

and

"user 37 (yeti@yesware.com) logged in"

Now, extracting the user id and email from either of these is not terribly hard. In fact it’s arguably not hard at all. The problem isn’t with one such format, but when every log statement has a slightly different way of formatting things, it’s a problem of scale. At some point, the code to extract all the possible ways a user id is specified is huge. And that’s just one field. The key=value formatting that we prefer to use however, is much easier for a machine to read. It makes writing parsers to extract attributes much more cookie cutter, rather than each one looking slightly different. While YetiLogger does support unstructured log messages, by far the preferred way to use it is to take advantage of the key=value style format.

That’s a little about how we got to where we are with YetiLogger. Now let’s take a look at how to use it.

How it works

To start with, install it via RubyGems:

gem install yeti_logger

YetiLogger is a wrapper around a Logger-like class, so before you can you it, you must configure it. The only required configuration is to specify a logger class to use for the actual logging.

require 'yeti_logger'
 
YetiLogger.configure do |config|
  config.logger = Rails.logger
end

The logger can be any Logger-like class that YetiLogger will defer to for actual logging. It also relies on the underlying logger for configuration such as log levels.

Once you have a configured YetiLogger, you can begin mixing it into your classes and modules. Adding it to a class will give you both class-level methods as well as instance-level methods such as log_info, log_warn, etc.

class MyClass
  include YetiLogger
 
  def test_logging
    log_info("hello!")
  end
end
 
MyClass.new.test_logging

The above will output a log line that looks like this:

2016-04-05T10:23:29.135-04:00 pid=90811 [INFO] - MyClass: hello!

The bits at the beginning of the line are all specified via the underlying Logger format configuration. The last bits of the line (MyClass: hello!) all come from YetiLogger. In general, the format for a log message is <Class name>: <log message>. The signature for each of the log_* methods looks like this:

log_info(obj = nil, exception = nil, &amp;block)

Yes, everything is optional, but that’s to give you the most flexibility. Let’s drill into each of the arguments separately.

The first arg is obj. This can be any object, but it really gets boiled down to one of three types: a hash, an exception, and everything else. See the next paragraph if obj is an exception. If it is not an exception, nor a hash, then we simply call #to_s on it and log that. If obj is a Hash, though, then we convert it into a key=value key=value string of the hash’s values. Remember that one of the reasons for YetiLogger was formatting? We found that key value formatting of log messages helps us write tools that look through logs for information. It’s much easier to find all activity associated with a user by searching for user_id=1234, rather than having to remember what format to keep things in so you don’t wind up matching all versions of logging user id: user_id:1234, user_id(1234), or the worst 1234.

The second argument is exception. This argument, if present, should be a Ruby Exception. YetiLogger will print the message, class and backtrace of the exception. If the first argument (obj) is a Hash, the exception details will be added to it and equivalently logged as key=value pairs.

The last argument, a block, is the preferred way to call YetiLogger for logging messages below the level you specify for logging. For instance, if you’ve configured the Logger to log at the info level, then all debug-level log statements should use the block form of calling. The block form of calling YetiLogger defers evaluation of the block until after the log level is checked. This allows you to leave in log statements that may be very expensive to compute (lookup additional data from a database for instance) and be assured that they won’t slow down your app, until you crank up the log level so they are evaluated. The value returned from the block follows similar rules above for the obj parameter. For instance:

log_info(user_id: user.id, msg: "logging in")

is equivalent to

log_info do
  {
    user_id: user.id,
    msg: "logging in"
  }
end

We routinely run with our logging levels set at info, but we still use the block form frequently as it’s also a convenient way to encapsulate all the logic associated with forming a log message. For instance:

log_info do
  msg = if user.first_login?
    "first login"
  else
    "login"
  end
  {
    user_id: user.id,
    msg: msg
  }
end

Continue reading more about YetiLogger here, including test support and the internal message formatters that you may find useful outside of a logging context.

Conclusion

I hope you liked reading about YetiLogger and that you find some use for it in your projects. As always, let us know via issues or pull requests against the repository about any questions or improvements you may want to see. We’ve been using YetiLogger pretty much untouched for a few years now and have been quite happy with it. It’s been a huge time saver in terms of instilling conventions that we find useful for debugging issues via logs, and hope you do too.

https://github.com/Yesware/yeti_logger

Using the MongoDB Oplog to trigger asynchronous work

Background

Over the past few years at Yesware, we’ve settled into a frequent pattern for handling asynchronous chunks of work in various apps in our microservices architecture. Typically thus far in our Ruby applications, the pattern has gone something like this:

  1. An incoming piece of data arrives, through an HTTP endpoint, or a database write, or a read from a message queue. We need to do no more than 1 or 2 quick things with that data before responding or exiting so that we can keep up with the onslaught of incoming data. Usually this means writing the piece of data to a database, then enqueueing to a message queue for further asynchronous processing.

  2. Once the data or a pointer to it has been written to a message queue, a worker reads that job, processes the data, then publishes to some other queue, or takes some kind of direct action.

This pattern is not a bad one, and it has served us well in many cases. However, it adds overhead in the form of extra services to stand up (a message queue), boilerplate code to maintain for things like intermediate data structures, and undesirable complexity. For a recent project we decided to try a new approach. This project involves reading from a high-volume exchange on RabbitMQ (our latest message queueing system of choice), writing much (but not all) of that data to what we expect will soon be a very large MongoDB cluster, then updating a smaller Postgres database with aggregated stats about the data we’ve just written.

The filtering of raw data and storing into MongoDB is fast, but aggregating the data into Postgres will take far too long to allow the RabbitMQ queue to keep up. Now, you haven’t truly lived until you’ve let an overflowing queue send the RabbitMQ cluster into flow control, throttling your publishers such that some of your precious data is dropped instead of published, but we’ve noticed an interesting trend. Our customers tend to like it better when things work and data doesn’t get lost, so we do our best to ensure flow control remains a distant threat. Asynchronous aggregation it is. Since we’re writing the data to a MongoDB replica set, why not use Mongo’s inherent replication functionality to trigger the downstream processing?

MongoDB Replication

Replication in a MongoDB cluster is handled by secondary nodes requesting records from the primary node’s oplog, then applying those changes as they come in. Since the primary stores its oplog in a Mongo collection, any other process can read that collection and do whatever it likes with the changes as they occur. There is some prior art for this on the web, but not much in the Ruby world. However Stripe developed a nifty gem a while back called mongoriver that does just that. It reads from the oplog, maintains its position in the oplog with a timestamp stored back into Mongo, and uses an evented model to issue callbacks when various types of operations occur. Sounds great, right? It kind of is, but we encountered a few bumps during implementation.

Setup

To use mongoriver, you need a MongoDB replica set. Without a replica set, there is no replication (duh!), which means no oplog (doh!). We usually develop against a single node Mongo in development, but to get this working in a development environment, we had to set up a couple more nodes and convert them into a replica set. This is as simple as creating new locations for your extra clusters, then using the mongo console to start the replica set. MongoDB has a great summary of that here.

Once that’s going, you’ll need a couple of classes in your Ruby app to handle the oplog: an outlet that is triggered when new operations occur, and a worker to set up the tailer and stream the oplog. The outlet is the easy part, let’s start with that.

class FilteredThingOutlet < Mongoriver::AbstractOutlet
 
  # This method will be called on every insert in the oplog, and will be given 3
  # params: the DB name, the collection name, and a hash that is the document being
  # inserted.
  def insert(db_name, collection_name, document)
 
    # We only want to publish documents of the right type that have a user_id
    if collection_name == "filtered_things" && document.keys.include?('user_id')
 
      # Publish the full document (in our case this also wraps the document in a Thrift
      # struct) for downstream processing
      RabbitMQPublisher.publish(document)
    end
  end
end

Easy. The worker is a bit more complex, as it needs to first set up a tailer that can read from the oplog, then stream output from that tailer to the outlet. Also, in order to maintain its position in the oplog, we use a PersistentTailer that knows how to save that position to a live mongo connection. In a simple development environment, this can usually be the same connection that the oplog is reading from.

class FilteredThingOplogWorker
  def self.run
    # Get the MongoDB connection from the MongoMapper model (more about MongoMapper in a bit...)
    mongo_connection = FilteredThing.collection.db.client
 
    # This will persist the oplog position to the DB every 60s into a collection called
    # ‘oplog-tailers’ by default
    tailer = Mongoriver::PersistentTailer.
      new([mongo_connection],
          :existing, # Use an existing MongoDB connection (instead of creating a new one)
          'filtered_things' # A name for the position persistence to use. Using something
                            # similar to the data in the collection being tailed makes
                            # sense here.
      )
 
    # Hook up the oplog stream to our handler class, FilteredThingOutlet
    stream = Mongoriver::Stream.new(tailer, FilteredThingOutlet.new)
 
    # Stream 4ever!
    stream.run_forever
  end
end

We then wrap this in a simple rake task that calls FilteredThingOplogWorker.run and watch the streaming happen. In a development environment, this works swimmingly. But in a production environment, there are typically separate users for each database, even if those users have the same name and password. In our case, the databases are filtered_data, where the data lives, admin, where the oplog is, and _mongoriver, which is the default name of the DB to which Mongoriver will persist its position. Unfortunately, this means using separate authentications for each database, but we can at least authenticate multiple times on the same connection. In addition, the default 60 second persistence is perhaps a little conservative for our tastes, but that’s also easy to change by passing an option to the tailer. The worker then becomes a little more complicated.

class FilteredThingOplogWorker
  def self.run
    # This will persist the oplog position to the DB every 10s with the
    # :save_frequency option
    tailer = Mongoriver::PersistentTailer.
      new([mongo_connection], # Now defined by the method below
          :existing,
          'filtered_things',
          {
            save_frequency: 10, # Persist position every 10s (overriding the 60s default)
            db: '_mongoriver' # Store the position in this DB
          })
 
    stream = Mongoriver::Stream.new(tailer, FilteredThingOutlet.new)
 
    stream.run_forever
  end
 
  # Get a Mongo connection that has permissions to tail the oplog, and to store
  # the state in the _mongoriver DB. This means authenticating with 2 additional
  # DBs. The user/pass combos are the same on all 3 DBs (admin, _mongoriver, and
  # filtered_data), so no extra config is necessary.
  def self.mongo_connection
    # Only need the extra authentication in production
    if Rails.env.production?
      # 'hosts', 'user', and 'password' should be pulled in from the environment, or
      # from a Mongo configuration (ie, mongo.yml)
      Mongo::ReplSetConnection.new(hosts).tap do |conn|
        conn.db('admin').authenticate(user, password)
        conn.db('_mongoriver').authenticate(user, password)
      end
    else
      # Everywhere except prod, just reuse the FilteredThing Mongo connection
      FilteredThing.collection.db.connection
    end
  end
end

There are a couple of things worth pointing out. First, we’ve specified the database _mongoriver for storing the position. Technically it can be called whatever you like, and it could even be the same database from which you’re reading the oplog. However, if it is, then you have to deal with the fact that the outlet callbacks will fire when the position is written, since it’s just another insert. I think it’s cleaner to have a separate database for the position, even if it only has 1 collection – oplog-tailers – with only 1 document. Incidentally, the oplog-tailers collection name can also be overridden via the :collection option.

In addition, because there is a set frequency at which the tailer will save its position, the outlet callbacks may fire on duplicate oplog entries in the case where the worker restarts or crashes in the middle of the window. We’ve designed our downstream aggregation processing to gracefully handle duplicate publishes, so that isn’t a problem. In other use cases, it might require extra work to ensure the downstream processing is idempotent, since there will certainly be duplicates at some point.

Results

As you can see, this new pattern requires very little code, which of course means much less maintenance overhead and general complexity. In addition, since it relies on MongoDB’s existing replication framework, which has to be fast in order for replication to function properly, the total throughput from document write to RabbitMQ publish is nearly immediate and not subject to any additional dependencies.

However, there is a substantial, though not insurmountable downside to this new style. We’ve historically used MongoMapper instead of Mongoid as our MongoDB ORM of choice at Yesware. We’ve written many plugins for it and lean on it pretty heavily across many of our microservices. But version 0.13.1 came out in 2014, and while there does seems to be some recent activity on master, it doesn’t appear to have an active maintainer anymore. In addition, now that Mongoid no longer requires its own driver (Moped) but instead uses the default MongoDB Ruby driver version 2, the choice of MongoMapper for our ORM has been questioned. Some of our recent microservices have used Mongoid with success, although in a pretty basic capacity – they aren’t attempting to use more advanced features like covered queries, for instance, that are not supported by Mongoid. Among other things, the connection objects were completely rewritten in version 2 of the Ruby driver, and Mongoriver doesn’t work with them, which means Mongoriver and Mongoid are not compatible (possibly older, Moped-based versions of Mongoid work, but that does not interest us). In fact, like MongoMapper, Mongoriver looks like it may have been abandoned of late; its most recent comment is also from 2014.

This means that we’re using a seemingly unmaintained gem (mongoriver) which relies on another seemingly unmaintained gem (mongo_mapper), and neither can be updated to use the most recent Ruby driver. This is fine for now, but will eventually be a problem, since in the future we’ll probably need to upgrade to a Mongo too new to support the older driver. We’ve considered starting our own forks of MongoMapper and Mongoriver to get around this problem, and we may well do that, but the potential burden of that extra work is definitely a downside with this strategy. It’s close to a turnkey solution for now, but may not be for long. For anyone considering adopting Mongoriver, this is a meaningful consideration.

Despite the potential maintenance downside, the addition of Mongoriver to our workflow seems like a smashing success so far, and I expect we’ll be looking to it to make data passing easier in other places where we can piggyback on the existing replication infrastructure that MongoDB already provides.

Extra Credit

At Yesware, we love to be woken up at 3 am because something went terribly wrong. Wait, that’s not right. We love it when something goes wrong at 3 am and pages us. No, that doesn’t sound right either. We hate it when things go wrong, but on the rare occasion when it does, we want to know about it ASAP. (Preferably not at 3am. Yeah, that’s it.) This means we love monitoring, and nearly all of our features involve some degree of monitoring so that we know their health at all times. This feature was no different. Since we’re storing the tailer’s position in the oplog every 10 seconds, why not record a metric there so we can alert on any potential lag? Let’s take a look at the record that gets written to oplog-tailers with the tailer’s position.

{
    "_id": {
        "$oid": "55cd8d4ef4da36b0fe2aad19"
    },
    "service": "filtered_things",
    "state": {
        "time": {
            "$date": "2016-03-10T02:36:29.000Z"
        },
        "position": {
            "$ts": 1457577389,
            "$inc": 1
        }
    },
    "v": 1
}

It’s a pretty simple document, basically containing an _id like every MongoDB doc, the service name we told the tailer to use, a timestamp, and a position. We created a rake task that we can call every 10 minutes to fetch this document, compute how far behind realtime the timestamp position is and record that to our statsd server.

task :filtered_things_oplog_worker_delay => :environment do
   include YetiLogger # https://github.com/Yesware/yeti_logger
   mongo = FilteredThingOplogWorker.mongo_connection
 
   # Fetch the current record
   record = mongo.db("_mongoriver").collection("oplog-tailers").
              find(service: "filtered_things").first
 
   # Determine the last time this job ran, and where it is in the oplog
   last_run_at = record["state"]["time"]
   oplog_at = Time.at(record["state"]["position"].seconds)
   seconds_behind = last_run_at - oplog_at
 
   log_info(worker: FilteredThingOplogWorker, msg: "oplog tailing state", seconds: seconds_behind)
   Metrics.gauge("work.FilteredThingOplogWorker.oplog_behind", seconds_behind)
 end

Then we set an alert when the lag falls above a threshold that concerns us, and our confidence is increased. And hopefully we’re all sleeping soundly at 3am.

Some things you wanted to know about fonts (but were too afraid to ask)

As a web developer, your most common font problem is probably “Why doesn’t this character look like what I expect it to look like?”

Your problem is either

This character is legible but isn’t in the nice font I/my company shelled out big bucks for

or

This character isn’t even legible

In the latter case, your character may look like ’this’, or it may just be a bunch of �����.

First I’m going to define some concepts with which we can troubleshoot most font problems. Then I’ll apply them to the three typical font problems I mentioned above.

Encoding: A set of mappings from sequences of bits to characters. For example, ASCII is a set of mappings from sequences of seven bits to characters, in which 01101000 maps to h.

Code point: The key in such a mapping, e.g. 01101000.

Glyph: An image of a character. For example, these are all glyphs for the character a:

(source: https://en.wikipedia.org/wiki/Glyph#/media/File:A-small_glyphs.svg)

Font: A set of glyphs representing a range of characters. These are the glyphs that make up the font Comic Sans Regular. The characters supported by a font typically belong to a group, such as ASCII characters or math symbols.

Now that we have a lexicon with which to talk about these things1, we can make some headway with troubleshooting each of the three font issues I mentioned:

  1. Correct but ugly characters. This must mean whatever is rendering the characters (since you’re probably a web developer, it’s probably your browser) isn’t using the intended font. Either it doesn’t know that it’s supposed to use that font, or it knows but doesn’t have access to that font, so it can’t find the glyph images it needs to render.
  2. Wrong character (’). If, say, you’re expecting but getting ’, this must mean the bit stream underlying your string is being parsed incorrectly such that the bits representing ' are somehow being mapped to ’ instead of '. In this case, since you’re expecting one character but getting three, your bits don’t appear to be broken up properly to begin with. If you get one character but it’s the wrong one, that also means the bits -> character mapping went awry. Either way, this sounds like an encoding problem (encoding == mappings, remember?) Maybe your string was encoded in a 8-bit encoding such as ISO-8859-9, but being interpreted with a 7-bit encoding such as ASCII.
  3. No characters (���, ???, what have you). Either there’s no mapping from those bits to a character in the encoding your browser is using (for example, ISO-8859-1 doesn’t know what to do with 0x1F), or there’s a mapping but there’s no glyph for that character in the font your browser is using (say, because your stylesheet specifies Comic Sans and the character is Japanese).

In later posts, we’ll walk through further debugging steps for each type of issue.


1 If you find my grossly simplified glossary unsatisfactory, here’s a more thorough and entertaining overview written by someone smarter than me.

Product Development Evolution at Yesware

When I started at Yesware just over two years ago I was its first ever Scrum Master. I had worked as a Scrum Master for years at a few different companies and I was really excited to come to Yesware. I loved the people, the location in downtown Boston, the office space (and that has since gotten even better!) and the culture.

One thing that wasn’t so great at the time was the product development process. I was hired to help with that.

Standups

The first thing I noticed was that standups didn’t seem to be delivering much value. The team went around in a circle and talked about what they were up to, but it wasn’t always clear which work items they were referring to, and how many items they were working on at once. To manage project work, the teams were using a bug tracking tool. That’s great for bug tracking, but not great for Scrum teams. So I switched tools to something designed for Scrum teams (JIRA) and I began using the work board feature as the visual in standups. When team members spoke about what they were doing it was very obvious if it was actual sprint work or if it was outside work. I think standups improved.

I take the rules of standups very seriously. We only get fifteen minutes. People should not go into deep dives; take it into a separate conversation. People should let their fellow team members know if they accomplished what they said they would in the prior standup.

Since I work with all the teams at Yesware, I am in a ton of meetings, but that doesn’t mean that everyone should be. Standups are so valuable because they reduce the need for other meetings. Everyone knows there’s an opportunity to sync up with the team at the same great time, same great location every day. When issues arise, the involved individuals can hold a follow up meeting, not dragging everyone into the discussion if that’s not prudent.

In case it’s not obvious, I love standups.

Planning

So, standups were improving. I wish I could say the same for planning meetings!

At the time, we were working in two week sprints. Our meetings were scheduled for four hour blocks and more often than not the teams used up the entire four hours. It is really hard to sit through a four hour meeting, even if you love meetings, which I do. Teams would discuss sprint work in great detail, and then several days into the sprint everyone would forget what we had talked about in planning. Definitely not a good use of time.

Also, since estimating in story points and tracking velocity had worked well at my past companies, I wanted to do that at Yesware. It just didn’t work here though and it was another time suck. So we stopped spending time on any detailed estimating and now do only high level, t-shirt size estimates.

Where we are now

Fast forward to our new and improved process! We don’t strictly follow Scrum anymore, but instead do kind of a hybrid of Scrum/Kanban/Yesware special sauce. We still do standups, because they are awesome! Every morning (but not too early) we meet up, talk about what needs to be done, and get on with our day.

We still do planning meetings but they are nothing like before. We take more of a just in time approach to planning. Teams have two planning meetings a week on their calendars but they only happen if there is work to discuss. We talk about what we are doing now, and once we have enough detail for people to get to work, we break and go back to our desks. Oh yeah, and each meeting is only scheduled for an hour. There’s no worrying about anyone falling asleep due to meeting exhaustion.

Other improvements

In addition to the day-to-day process changes, we’ve made improvements at a higher level as well. I’d like to highlight some of those.

Besides standups, my favorite meeting from the Agile world is the retrospective. I love to talk about things we should change to work more efficiently. We have retrospectives about every two weeks and it’s a closed door meeting, meaning only the team members may attend. People are free to say whatever they feel, and what happens in retro stays in retro. Many process changes have come as a result. For example, discussing issues with using our work board has led to changing work-in-progress constraints and simplifying the board’s workflow. Even little things, like putting a status dashboard on a big TV and tracking standup tardiness to encourage faster meetings, had their roots in retros.

The simplification of daily process has made it easier to see the big picture and what’s really going on. Back when we did Scrum I kept an eye on how many work items a team member was working on, but it wasn’t a focal point for the entire team. Now it is! We closely monitor our work in progress to make sure we are not spreading ourselves too thin and working on a bunch of stuff when we could be delivering fewer things sooner.

Finally, reducing the duration and frequency of meetings has allowed us to improve our scheduling. Back in the old days we would have essentially all day meetings at the start of each Sprint. Now the majority of meetings are in the morning, generally right after standup. People attend their meetings and then have a big chunk of uninterrupted time to actually get stuff done. The cost of context switching is well known and I try hard to minimize that at Yesware.

Wrapping up

Things work differently than they did when I started here, but I think nearly everyone feels it’s been a change for the better. We certainly haven’t figured everything out at Yesware. But we have taken plenty of steps to improve our process, and we continue to tweak it.

VSTO Lessons Learned

On the MSDN website, the Office and SharePoint Development in Visual Studio page discusses two primary options for developing an Office add-in: 1) use the latest and greatest Office add-in technologies targeting Office 2013, and SharePoint 2013, or 2) use Visual Studio Tools for Office (VSTO) to presumably target versions of the Office 2010 and older.

If you intend to distribute your Office add-in to the general public, don’t use VSTO.

VSTO offers many features to develop an add-in, including graphical editors to define Office UI customizations. It has extremely simple tools to manage deployment, and it magically loads code whenever the user starts your targeted Office application.

The problem with deploying it to the general public is that the VSTO Runtime (VSTOR) is a prerequisite that the user must have installed, and dealing with the VSTOR in the wild sucks.  In our experience, a number of users simply could not install the VSTOR, and there were issues even with those who could.

VSTOR upgrades can leave behind problematic artifacts, and resulting failures often produce opaque error messages. You can find a number of complaints on the MDSN Forum about one particular artifact that VSTOR upgrades leave behind. Below is one example, with no apparent, official resolution—just an incomprehensible error message:

The value of the property ‘type’ cannot be parsed. The error is: Could not load file or assembly ‘Microsoft.Office.BusinessApplications.Fba, Version=14.0.0.0, Culture=neutral, PublicKeyToken=71e9bce111e9429c’ or one of its dependencies. The system cannot find the file specified. (C:\Program Files (x86)\Common Files\Microsoft Shared\VSTO\10.0\VSTOInstaller.exe.Config line 10)

MSDN Forum

Additionally, if your VSTO add-in takes too long to load, office applications can disable it. This is likely to happen, because the timer starts before the CPU reaches the add-in’s instructions. That’s right – your application can be penalized for “performance reasons” before your add-in is even loaded, nevermind starting to run. And if any of your potential users are on a slow computer, there is little that optimizing your code can do. Imagine our frustration trying to trace performance issues in our application when our users reported that it was disabled due to “being too slow” when the many reasons that it may be “too slow” didn’t involve our code at all.

The deployment mechanisms that VSTO borrows from ClickOnce can also fail, again with inscrutable error messages such as “Value does not fall within the expected range.” Here is another product’s support page that discusses a workaround that isn’t particularly user friendly.

Apparently, the VSTOR is prone to becoming corrupt. Our customer service representatives communicated that while troubleshooting with the user, our application started to work correctly when the user did nothing more than run the “Repair” tool for the VSTOR.

Common themes for these issues involve the development team spending too many hours on the following actions:

  • Investigating incidents with paltry and inscrutable logs
  • Identifying potential causes and fixes
  • Experimenting with fixes under various foreseeable circumstances
  • Experimenting with fixes in the wild
  • Refining those fixes as users give us feedback

Some of these issues seemed intractable, even as users went to great lengths to avail themselves for troubleshooting. Others went silent after prolonged communications with our support team going through our support script. I imagine a large portion of folks simply walked away after encountering the first issue or two trying to set up our product; with the number of possible issues, many folks could fall into this category.

With all this doom and gloom, what alternative is there? There is some talk about approach that involves implementing the IDTExtensibility2 interface. There is also the suggestion to use a tool like Add In for Microsoft Office and .net, which seems to do the heavy lifting implementing IDTExtensibility and providing features that have allowed us to replace VSTO and overcome the aforementioned problems by avoiding them (the VSTOR) altogether.

Moving to Add In Express has been undeniably the right move for Yesware for Outlook product development rather than sticking with VSTO. No difficult-to-install, buggy prerequisite exists. Therefore, prospective users’ ability to install our product has greatly improved. The ADX loader has resolved the frequent complaints about our Yesware for Outlook being disabled due to slow loading.

There are still some issues for us to improve. We kept the ClickOnce deployment mechanism since we focused on removing the VSTO dependency first, but ClickOnce can sometimes fail in the same user unfriendly ways that were already mentioned before. Add In Express offers a deployment mechanism which leverages the Windows Installer which is likely to make our installer reliable. We are always working on improving our product, so expect an MSI option for Yesware for Outlook soon!

Hello World

Welcome to the Yesware Engineering Blog. Behind the awesome software we build to help salespeople is a great team of engineers and product folks. In this blog you’ll hear from several of them on topics ranging from product design to debugging issues in .NET.

Please comment on these blog posts and ask questions and we’ll do our best to answer you. We’re a small team and really proud of the culture and software we’ve built and are eager to share with you some of our experiences. We hope you benefit from our posts as much as we’ve benefited from solutions to our own problems found on your blog posts.

In the meantime, please check out the Yesware github page for projects we’ve open sourced (something we’re planning on doing more of), as well as some of our data blogs where we’ve mined our data sources for insights to help salespeople.

Thanks and welcome to the Yesware Engineering Blog.