How we built it: Making Hadoop data exploration easier with Ad-hoc SQL Queries

Aug 28 2014, 10:05 am

Poorna Chandra is a software engineer at Continuuity where he is responsible for building software fueling the next generation of data applications. Prior to Continuuity, he developed big data infrastructure at Greenplum and Yahoo!
Julien Guery is a software engineering intern at Continuuity and a Masters student at French engineering school Télécom Bretagne.

We are excited to introduce a new feature added in the latest 2.3 release of Continuuity Reactor – ad-hoc querying of Datasets. Datasets are high-level abstractions over common data patterns. Reactor Datasets provide a consistent view of your data whether processed in batch or real-time. In addition to scans and RPCs of Datasets, there is also a need to explore data through an easy-to-use SQL query interface. This allows you to run ad-hoc interactive queries on Datasets, generate reports on Datasets, and integrate Reactor with the BI tool of your choice. In this blog, we will talk about how this works and how we expose an easy-to-use interface for querying data.

In keeping with our mission of providing simple access to powerful technology, we set out to enable ad-hoc querying capabilities of Datasets without adding additional overhead to the user. We accomplished this by integrating the underlying technologies that enable declarative querying into Reactor. One of our important design goals was to introduce this feature without our users having to manage additional services or frameworks outside of Reactor. Below we will share how we built this system.

Hive as the SQL engine

SQL is a widely adopted standard for ad-hoc querying, and it’s a great way to make it easy for users to interact with Datasets. While there are a number of existing SQL engines in the Hadoop ecosystem, we decided to go with Apache Hive because one of its core components, Hive Metastore (which stores metadata about Hive tables), is also used by other existing SQL engines like Presto and Impala. This means that later integrations of those engines will be easier.

Datasets are non-native, external tables from Hive’s point of view. Hive has a well-defined interface for interacting with non-native tables via Storage Handlers. To integrate Hive with Datasets, we implemented a custom Dataset Storage Handler to manage the interactions. The Dataset Storage Handler is composed of:

  • a DatasetInputFormat that knows how to retrieve Dataset Records from Datasets, and
  • a DatasetSerDe that converts the records into objects that Hive can work with.

The data flow between Hive and Datasets is depicted below:

Hive as a Service in Reactor

Typically, a Hive installation consists of a Hive Metastore server that contains the metadata and a Hive Server that runs the user queries. In our case, we can connect to an existing Hive Metastore for metadata. However, connecting to an existing Hive server has the following limitations:

  • Hive Server does not have interfaces that allow access to the complete query life cycle. This limits transaction management while dealing with Datasets.
  • Hive Server runs user code in server space, which makes the service vulnerable to bad code.
  • Hive Server’s HTTP interface does not support security and multi-tenancy.

To overcome these limitations, we created a new HTTP-based Hive Server that addresses the above issues by wrapping Hive’s CLIService and integrating with Reactor’s Datasets and transactions engine Tephra. We chose to run the Hive Server in YARN containers for proper user code and resource isolation, as well as scalability.

Hive-Reactor Integration challenges

Since Hive jars bundle disparate libraries like Procotol Buffers and Guava, we came across a recurrent classloading issue. This is because Reactor uses different versions of these libraries. We had to make sure that the classpaths of the different components that run Hive in Reactor have their jars in a particular order. We fixed the classpath order in various places by doing the following:

  • In Reactor Master: since we only launch Hive service from Reactor Master, it was sufficient to isolate Hive jars in a separate classloader to avoid any library version conflicts.
  • In the YARN container running Hive service: we set the classpath of the container in the right order during its initialization since we have control over it.
  • In the mapreduce jobs launched by Hive in the YARN cluster: we set the Hadoop configuration setting mapreduce.job.user.classpath.first to true so that Reactor jars contained in the hive.aux.jars configuration would come earlier than Hive jars in the classpath of those mapreduce jobs.
  • In the local mapreduce jobs launched in the same YARN container as the Hive service: we changed the way Hadoop classpath was set by modifying the HADOOP_CLASSPATH environment variable. We also set the environment variable HADOOP_USER_CLASSPATH_FIRST to true so that the HADOOP_CLASSPATH content would come earlier than Hive jars in the classpath of the local mapreduce jobs.

Future work

We are working hard to enable a more fun and productive user experience for data exploration in Hadoop. One of the features that we plan on introducing in the future is a JDBC driver to connect third-party BI tools to Reactor Datasets. This will make our platform even more accessible to all users who want to work with data in Hadoop.

To try out our latest new features including Ad-hoc SQL queries, download the Continuuity Reactor 2.3 SDK and check out the developer documentation to get started.

And if this work sounds exciting to you, check out our careers page and send us your resume!

Comments

Running Presto over Apache Twill

Apr 3 2014, 11:35 am

Alvin Wang is a software engineer at Continuuity where he is building software fueling the next generation of Big Data applications. Prior to Continuuity, Alvin developed real-time processing systems at Electronic Arts and completed engineering internships at Facebook, AppliedMicro, and Cisco Systems.

We open-sourced Apache Twill with the goal of enabling developers to easily harness the power of YARN using a simple programming framework and reusable components for building distributed applications. Twill hides the complexity of YARN with a programming model that is similar to running threads. Instead of writing boilerplate code again and again for every application you write, Twill provides a simple and intuitive API for building an application over YARN.

Twill makes it super simple to integrate new technologies to run within YARN. A great example of this is Presto, an ad-hoc query framework, and in this blog, I’ll explain what it is and how we were able to make Presto run within YARN using Twill in a short period of time.

Why did we want to run Presto over Twill?

We wanted to add ad-hoc query capabilities to our flagship product, Continuuity Reactor. We looked at different frameworks and got started on experimentation with Presto because it is written in Java and is emerging as an important big data tool. The next question was how to integrate it? We opted to run Presto within YARN because it gives developers the flexibility to manage and monitor resources efficiently within a Hadoop cluster, and the capability to run multiple Presto instances.

We use Twill extensively in Reactor for running all services within YARN. So, in order for us to run Presto within Reactor, we had to integrate it with Twill.

What is Presto?

Presto is a distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Last fall, Facebook open-sourced Presto, giving the world a viable, faster alternative to Hive, the data warehousing framework for Hadoop. Presto was designed for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of large organizations.

How does Presto work?

When executing a query, the query is sent to the coordinator through the command-line interface. The coordinator distributes the workload across workers. Each worker reads and processes data for their portion of the input. The results are then sent from the workers back to the coordinator, which would then aggregate the results to form a full response to the query. Presto works much faster than Hive because it doesn’t need to run a new MapReduce job for every query, as the workers can be left running even when there aren’t any active queries.

How did we integrate Presto with Twill?

First, we needed to run Presto services (discovery, coordinator, and worker) embedded in TwillRunnables, which posed a couple of challenges:

  • The public Presto distribution provides Bash scripts and Python scripts for running Presto services, but has no documented way to run in embedded mode.
  • Presto services normally use external configuration files for various properties like discovery URI and HTTP binding port.
  • Presto on Twill needed to handle varying discovery URIs since YARN cannot guarantee that the discovery service would run on a particular host since any host could become unavailable.
  • So, we configured the Presto services programmatically:

    Bootstrap app = new Bootstrap(modules.build());
       app.setRequiredConfigurationProperty("coordinator", "true");
       app.setRequiredConfigurationProperty("datasources", "jmx");
       app.setRequiredConfigurationProperty("discovery-server.enabled", "false");
       app.setRequiredConfigurationProperty("http-server.http.port", propHttpServerPort);
       app.setRequiredConfigurationProperty("discovery.uri", propDiscoveryUri);
    

    Next, we needed to get Presto services to use an existing Hive metastore with the Hive connector so that Presto CLI can run queries against Hive tables. While Presto includes basic documentation for file-based configuration of the Hive connector, there isn’t any documentation on how to do it programmatically. To tackle this, we inspected the code that loads the Hive connectors. We found that ConnectorManager.createConnection() was setting up the connectors, but the ConnectorManager instance was a private field in CatalogManager, so we had to use reflection. While not ideal, it worked. In the future, we may contribute our source code to Presto to make it easier to embed in Java. The code we used to register the Hive connector is shown below:

    injector.getInstance(PluginManager.class).installPlugin(new HiveHadoop2Plugin());
          CatalogManager catalogManager = injector.getInstance(CatalogManager.class);
          Field connectorManagerField = CatalogManager.class.getDeclaredField("connectorManager");
          connectorManagerField.setAccessible(true);
          ConnectorManager connectorManager = (ConnectorManager) connectorManagerField.get(catalogManager);
          connectorManager.createConnection("hive", "hive-hadoop2", ImmutableMap.builder()
            .put("hive.metastore.uri", propHiveMetastoreUri)
            .build());
    

    Once we were able to run embedded Presto without Twill, we packaged Presto with all the dependency jars into a bundle jar file to avoid dependency conflicts. Then we simply configured a Twill application to run various instances of BundledJarRunnable that were running the Presto services contained within the jar file. Below is a full example of a Twill application that runs Presto’s discovery service that is packaged within a jar file using BundledJarRunnable:

    public class PrestoApplication implements TwillApplication {
    
      public static final String JAR_NAME = "presto-wrapper.jar";
      public static final File JAR_FILE = new File("presto-wrapper-1.0-SNAPSHOT.jar");
    
      @Override
      public TwillSpecification configure() {
        return TwillSpecification.Builder.with()
          .setName("PrestoApplication")
          .withRunnable()
          .add("Discovery", new BundledJarRunnable())
          .withLocalFiles().add(JAR_NAME, JAR_FILE.toURI(), false).apply()
          .anyOrder()
          .build();
      }
    
      public static void main(String[] args) {
        if (args.length ");
        }
    
        String zkStr = args[0];
    
        final TwillRunnerService twillRunner =
          new YarnTwillRunnerService(
            new YarnConfiguration(), zkStr);
        twillRunner.startAndWait();
    
        // configure BundledJarRunnable
        BundledJarRunner.Arguments discoveryArgs = new BundledJarRunner.Arguments.Builder()
            .setJarFileName(JAR_NAME)
            .setLibFolder("lib")
            .setMainClassName("com.continuuity.presto.DiscoveryServer")
            .setMainArgs(new String[] { "--port", "8411" })
            .createArguments();
    
        // run Twill application
        final TwillController controller = twillRunner.prepare(new PrestoApplication())
            .withArguments("Discovery", discoveryArgs.toArray())
            .addLogHandler(new PrinterLogHandler(new PrintWriter(System.out, true)))
            .start();
    
        Runtime.getRuntime().addShutdownHook(new Thread() {
          @Override
          public void run() {
            controller.stopAndWait();
          }
        });
    
        try {
          Services.getCompletionFuture(controller).get();
        } catch (InterruptedException e) {
          e.printStackTrace();
        } catch (ExecutionException e) {
          e.printStackTrace();
        }
      }
    }
    

    As you can see, once you have your application running from Java code, Twill makes it straightforward to write a Twill application that runs your code inside a YARN container.

    Adding new features to Twill

    During the process of getting Presto to run over Twill, we contributed a couple of new features to Twill to make it easier for anyone to implement applications that have similar needs: We’ve added support for running Twill runnables within a clean classloader and we’re currently working on allowing users to deploy Twill runnables on unique hosts. In the future, we plan to open-source our Presto work so that anyone can spin up their own Presto services in YARN, and we are also considering support for Presto in Reactor to speed up ad-hoc queries.

    Apache Twill is undergoing incubation at the Apache Software Foundation. Help us make it better by becoming a contributor.

    0 notes

    Comments
    blog comments powered by Disqus