Tuesday, July 17, 2012

The DRb JRuby Bridge

This is a technique that has proved useful when a Ruby application needs to call some Java code, but writing the application entirely in JRuby is not appropriate.

Examples:
* A required Ruby module is not compatible with JRuby (e.g. Qt)
* The Java code is only needed by optional parts of the system such as plugins

The solution is to use Distributed Ruby (DRb) to bridge between a Ruby application and a JRuby process. A JRuby application is written that contains a DRb service which acts as a facade for the Java code. A Ruby class provides an API for managing the JRuby process. The Ruby application uses the API to start the JRuby process, connect to the facade object via DRb, and stop the JRuby process when it is no longer needed.


The following example provides a simple implementation which provides access to the Apache Tika library. This is, of course, provided only for illustration purposes; Tika comes with a much more useful command-line utility, and a (J)ruby-tika project already exists.

The directory structure for this example is as follows:

  bin/test_app.rb
  lib/tika-app-1.1.jar
  lib/tika_service.rb
  lib/tika_service_jruby


The JRuby application, tika_service_jruby, requires the java module and the tika jar. It also uses the tika_service module in order to define default port numbers and such.


#!/usr/bin/env jruby


raise ScriptError.new("Tika requires JRuby") unless RUBY_PLATFORM =~ /java/


require 'java'
require 'tika-app-1.1.jar'
require 'tika_service'


# =============================================================================
module Tika


  # ------------------------------------------------------------------------
  # Namespaces for Tika plugins


  module ContentHandler
    Body = Java::org.apache.tika.sax.BodyContentHandler
    Boilerpipe = Java::org.apache.tika.parser.html.BoilerpipeContentHandler
    Writeout = Java::org.apache.tika.sax.WriteOutContentHandler
  end


  module Parser
    Auto = Java::org.apache.tika.parser.AutoDetectParser
  end


  module Detector
    Default = Java::org.apache.tika.detect.DefaultDetector
    Language = Java::org.apache.tika.language.LanguageIdentifier
  end


  Metadata = Java::org.apache.tika.metadata.Metadata


  class Service
    # ----------------------------------------------------------------------
    # JRuby Bridge


    # Number of clients connected to TikaServer
    attr_reader :usage_count


    def initialize
      @usage_count = 0
      Tika::Detector::Language.initProfiles
    end


    def inc_usage; @usage_count += 1; end


    def dec_usage; @usage_count -= 1; end


    def stop_if_unused; DRb.stop_service if (usage_count <= 0); end


    def self.drb_start(port)
      port ||= DEFAULT_PORT


      DRb.start_service "druby://localhost:#{port.to_i}", self.new
      puts "tika daemon started (#{Process.pid}). Connect to #{DRb.uri}"
     
      trap('HUP') { DRb.stop_service; Tika::Service.drb_start(port) }
      trap('INT') { puts 'Stopping tika daemon'; DRb.stop_service }


      DRb.thread.join
    end


    # ----------------------------------------------------------------------
    # Tika Facade


    def parse(str)
      input = java.io.ByteArrayInputStream.new(str.to_java.get_bytes)
      content = Tika::ContentHandler::Body.new(-1)
      metadata = Tika::Metadata.new


      Tika::Parser::Auto.new.parse(input, content, metadata)
      lang = Tika::Detector::Language.new(input.to_string)


      { :content => content.to_string, 
        :language => lang.getLanguage(),
        :metadata => metadata_to_hash(metadata) }
    end


    def metadata_to_hash(mdata)
      h = {}
      Metadata.constants.each do |name| 
        begin
          val = mdata.get(Metadata.const_get name)
          h[name.downcase.to_sym] = val if val
        rescue NameError
          # nop
        end
      end
      h
    end


  end
end


# ----------------------------------------------------------------------
# main()
Tika::Service.drb_start ARGV.first if __FILE__ == $0


The details of the Tika Facade are not of interest here. What is important for the technique is the if __FILE__ == 0 line, the drb_start class method, and the inc_usage, dec_usage, and stop_if_unused instance methods. These will be used by the Ruby tika_service module to manage the Tika::Service instance.

When run, this application starts a DRb instance on the requested port number, starts a Tika::Service instance, and returns a DRb Proxy object for that instance when a DRb client connects.


The Ruby module, tika_service.rb, uses fork-exec to launch a JRuby process running tika_service_jruby application. Note that the port number for the DRb service to listen on is passed as an argument to tika_service_jruby.

#!/usr/bin/env ruby

require 'drb'


module Tika


  class Service
   DAEMON = File.join(File.dirname(__FILE__), 'tika_service_jruby')
    DEFAULT_PORT = 44344
    DEFAULT_URI = "druby://localhost:#{DEFAULT_PORT}"
    TIMEOUT = 300    # in 100-ms increments


    # Return command to launch JRuby interpreter
    def self.get_jruby
      # 1. detect system JRuby
      jruby = `which jruby`
      return jruby.chomp if (! jruby.empty?)


      # 2. detect RVM-managed JRuby
      return nil if (`which rvm`).empty?
      jruby = `rvm list`.split("\n").select { |rb| rb.include? 'jruby' }.first
      return nil if (! jruby)


      "rvm #{jruby.strip.split(' ').first} do ruby "
    end


    # Replace current process with JRuby running Tika Service
    def self.exec(port)
      jruby = get_jruby
      Kernel.exec "#{jruby} #{DAEMON} #{port || ''}" if jruby


      $stderr.puts "No JRUBY found!"
      return 1
    end


    def self.start
      return @pid if @pid
      @pid = Process.fork do
        exit(::Tika::Service::exec DEFAULT_PORT)
      end
      Process.detach(@pid)


      connected = false
      TIMEOUT.times do
        begin
          DRb::DRbObject.new_with_uri(DEFAULT_URI).to_s
          connected = true
          break
        rescue DRb::DRbConnError
          sleep 0.1
        end
      end
      raise "Could not connect to #{DEFAULT_URI}" if ! connected
    end


    def self.stop
      service_send(:stop_if_unused)
    end


    # this will return a new Tika DRuby connection
    def self.service_send(method, *args)
      begin
        obj = DRb::DRbObject.new_with_uri(DEFAULT_URI)
        obj.send(method, *args)
        obj
      rescue DRb::DRbConnError => e
        $stderr.puts "Could not connect to #{DEFAULT_URI}"
        raise e
      end
    end


    def self.connect
      service_send(:inc_usage)
    end


    def self.disconnect
      service_send(:dec_usage)
    end


  end
end


The API provided by this module is straightforward: an application uses the start/stop class methods to execute or terminate the JRuby process as-needed, and the connect/disconnect methods to obtain (and free) a DRb Proxy object for the "remote" Tika::Service instance.

The only complications are in the detection of the JRuby interpreter (including support for RVM-managed interpreters) and the timeout while waiting for JRuby to initialize (which can take many seconds).



The test_app.rb example application is a simple proof-of-concept. It takes any number of filenames as arguments, and uses Tika to analyze the contents of each file. The results are printed to STDOUT via inspect.


#!/usr/bin/env ruby
require 'tika_service'


Tika::Service.start
begin
  tika = Tika::Service.connect
  ARGV.each { |x| File.open(x, 'rb') {|f| puts tika.parse(f.read).inspect} } if tika
ensure
  Tika::Service.disconnect
  Tika::Service.stop
end


The real meat of this technique lies in the tika_service module. This contains a Service class that will manage a JRuby application in a manner that conforms to an abstract Service API (start/stop, connect/disconnect) and which can be generalized to support any number of specific JRuby-based services.


Update: A generalized (but entirely untested) version is up on Github.

2 comments:

  1. Thanks!

    Excellent idea. We needed to access Ruby methods in a JRuby process, from a non-JRuby process. This code served as an excellent base for what we ended up hacking together last night.

    Github for the jruby_bridge
    Rubygems for the jruby_bridge

    ReplyDelete
    Replies
    1. That is a very nice implementation. If I need to write a third Ruby-JRuby application (grog forbid), I'll give it a go.

      Delete