Continuuity Reactor 2.3: SQL and Security Release

Jul 23 2014, 10:22 am

Alex Baranau is a software engineer at Continuuity where he is responsible for building and designing software fueling the next generation of Big Data applications. Alex is a contributor to HBase and Flume, and has created several open-sourced projects. He also writes frequently about Big Data technologies.

The Continuuity Reactor platform is designed to make it easy for developers to build and manage data applications on Apache Hadoop™ and Apache HBase™. Every day we’re passionately focused on delivering an awesome experience for all developers, with or without Hadoop expertise. And today, we’re excited to release the next version of our platform, Continuuity Reactor 2.3.

In addition to continued stability, scalability, and performance, we have added a number of significant new features in Continuuity Reactor 2.3:

Ad-hoc SQL Queries

Procedures are an existing, programmatic way to access and query your data in Reactor, but sometimes you may want to explore a Dataset in an ad-hoc manner rather than writing procedural code. Reactor now supports ad-hoc SQL queries over Datasets via a new API that allows developers to expose the schema of a Dataset and make it query-able through a REST API. This enables the submission of SQL queries over Datasets along with retrieval of the results, submitted via REST and executed via Apache Hive or other Hadoop-based SQL engines.

Security Enhancements

We’re committed to making Hadoop applications secure. Continuuity Reactor now supports perimeter security, restricting access to resources only to authenticated users. With perimeter security, access to cluster nodes is restricted by a firewall. Cluster nodes can communicate with each other, but outside clients can only communicate with the cluster through a secured gateway.

Using Reactor security, the Reactor authentication server issues credentials (access tokens) to authenticated clients, and clients then send these credentials with their requests to Reactor. Calls that lack valid access tokens are rejected, limiting access to only authenticated clients. You can learn more about the authentication process on the Reactor Security page.

Additional Release Highlights

Other key enhancements in 2.3 include new Application, Stream, Flow, and Dataset features such as:

  • Stream support for data retention policy; reconfigurable at runtime, while in use
  • Stream support for truncate via REST
  • Simplified Flowlet @Batch support with process methods no longer requiring an Iterator
  • New Datasets API that gives more power and flexibility when developing custom Datasets
  • Dataset management outside of Applications exposes REST interfaces to create, truncate, drop and discover Datasets
  • New Application API with an improved way to define application components

Finally, we have added Reactor Services, an experimental feature that allows the addition of custom User Services that can be easily discovered from within Flows, Procedures and MapReduce jobs. We’ll have more services capabilities in our next release, but you can get an early preview of one of the features we are most excited about right now!

Try Reactor 2.3 Today

We are working hard to solve the challenging problems faced by both new and experienced data application developers and to enable a much more fun and productive development experience for Hadoop. Reactor unifies the capabilities you need when developing on Hadoop into an integrated developer experience so that you can focus on your application logic without the worries of distributed system architectures or scalability. Download the Continuuity Reactor 2.3 SDK and check out the developer documentation to get started.

We are excited about the latest release and would love to hear your thoughts. Please feel free to send us feedback at support@continuuity.com.

Comments

Meet Tephra, An Open Source Transaction Engine

Jul 18 2014, 8:00 am

Gary Helmling is a software engineer at Continuuity as well as an Apache HBase committer and Project Management Committee (PMC) member. Prior to Continuuity, Gary led the development of Hadoop and HBase applications at Twitter, TrendMicro, and Meetup.

Our platform, Continuuity Reactor, uses several open source technologies in the Apache Hadoop™ ecosystem to enable any developer to build data applications. One of the major components of our platform is Apache HBase, a non-relational, massively scalable column-oriented database modeled after Google’s BigTable. We use HBase for a number of reasons, including the strong data consistency it provides. One of the limitations of HBase as a standalone system, however, is that data updates are consistent only within a single region, or a set of contiguous rows, because it is very difficult to coordinate updates across these regions in a way that maintains scalability.

As a result, one of the tradeoffs is that HBase maintains consistency for a single row or region of rows, but anything across regions or tables, cannot be updated atomically—i.e., where the entire transaction is committed as one—nor can you do an atomic update that spans multiple remote procedure calls (RPCs). While we value what HBase provides, we believe providing globally consistent transactions simplifies application development a great deal, allowing developers to focus more on the problems and use cases they care about rather than on implementing complex data access patterns.

This is why we built Tephra, a distributed, scalable transaction engine designed for HBase and Hadoop. Tephra can also be extended to integrate with other NoSQL systems like MongoDB and LevelDB as well as traditional relational databases and data warehouses. Tephra is a powerful data management tool that makes a wide range of use cases easier to solve, especially online and OLTP applications. It utilizes the key features of HBase to make transactional capabilities available without sacrificing overall performance.

Today we’re open sourcing Tephra for anyone to use because we believe that the broader developer community can benefit from it, and for anyone to contribute to because we have built Tephra with extensibility in mind.

How can developers use Tephra?

One common use case is secondary indexes. Developers typically create secondary indexes on HBase by writing updates to a second table with additional rows that reference the rows in the main table based on the index values. The problem is that there isn’t consistency in operations across the two tables, so they can get out of sync. Based on their actual data access patterns and what their application cares about, developers are forced to adopt more complicated application logic to manage the data and work around the inconsistencies. In contrast, Tephra simplifies this use case by allowing updates to both tables to be performed in a single globally consistent transaction.

Why are we open sourcing Tephra?

Many developers and companies are successfully using HBase, but there are still gaps in its accessibility to developers. Tephra takes the strong foundation that HBase has given us to build upon and enhances it by making it more developer-friendly and broadening the potential users and use cases of HBase. We are open sourcing the technology because we want to give back to the community and believe Tephra will be useful to a broad range of developers.

We also are excited to see how others will use, apply, and extend Tephra transactions to their own applications, infrastructures, and environments. We recognize that developers have specific needs, some of which we haven’t anticipated, and we look forward to Tephra growing as a project and community.

Learn more and get involved

Check out the release notes or our slideshare for more details about Tephra. And please help us make the project better by joining our user and developer mailing list and contributing and reporting any issues, bugs, or ideas.

Comments

Behind the scenes: Hacking our way to success

Jul 7 2014, 8:47 pm

Sreevatsan Raman is a software engineer at Continuuity where he is building and architecting a platform fueling the next generation of Big Data applications. Prior to Continuuity, Sree designed and implemented big data infrastructure at Klout, Nominum, and Yahoo!

We just wrapped up our latest hackathon and it was a great reminder of the unique engineering culture we have at Continuuity. We have created a new application development platform, Continuuity Reactor, which is focused on allowing developers to quickly and easily build Big Data applications.

Building a platform that no one has created before is a big challenge. We break this huge effort into a continuous cadence of platform releases that are delivered to production frequently. Before every release we take a break from our daily efforts and hack on our platform for 48 hours where we stretch our imaginations and the platform capabilities we just built.

Every hackathon gives us an opportunity to dog-food our technology. We come together wearing our developer hats to build features and applications, incorporating our lessons learned into continually improving the developer experience, with the goal of making Hadoop more simple and accessible.

One of my favorite aspects of our hackathons is how the whole company comes together to build cool stuff and have fun. From our CEO to our engineering team to people in non-technical roles, everyone participates. Here are some thoughts and experiences about our company, culture, and hackathons from our awesome engineering interns:

Shu Das, University of Michigan

The unique aspect of Continuuity that I like is that everyone has a clear sense of his or her agenda and responsibilities, so we’re empowered to stay on top of our game. Not only do I have the resources I need and responsiveness from the rest of the team, but also the working environment at Continuuity is lively and enjoyable.

My first project was building an application on Reactor that visualizes data about the test cases we run on our code. This work gave me great insights into what our platform is, how to use it, and how our technology can be used for simplifying Hadoop. I really appreciate the fact that the feature I worked on is used daily, as a component of the development lifecycle, and not left off as a side project.

For the hackathon, I teamed up with Kenneth and Gourav (see below) to build a Reactor application that can be used to aggregate, correlate, and visualize data - for instance, metrics, logs, or any other events. It was amazing to see the application built in a very short amount of time using new core functionalities of the platform and dogfooding the new APIs, runtime, and documentation.

Gourav Khaneja, University of Illinois

The work here is interesting because the problems we’re solving are hard. One of my favorite aspects of Continuuity is the willingness of team members to help each other to work through challenges. For example, even during crunch time, every Continuuity member is willing to stop what he or she is doing to help out a fellow employee. I learn a lot from the team on a daily basis.

When I joined, I was tasked with optimizing resource allocation in YARN using Apache Twill. YARN has a large codebase and although my previous experience with a large code base was limited, I was able to come up to speed quickly with great mentorship from the team and contributed towards a major feature in Twill.

Kenneth Le, University of California, Berkeley

Interns are involved in relevant projects right away. While we receive guidance when needed, the focus of the internship program is more on empowering us to deliver and learning more via open communication about the various projects that other people are working on.

My first project was improving a developer tool that is used to deploy code to clusters. The existing tool took about 30 minutes to build and deploy the entire code base. The newer version, which I rewrote in Python, takes about 6 minutes, thus saving developers a lot of time in their development life-cycle.

Julien Guery, Ecole nationale supérieure des Télécommunications de Bretagne

This is an extremely technical company solving challenging problems. One of the first things I noticed is that the interns get to be part of the core engineering team and are involved in all aspects of the company.

In my first project I learned a lot about Apache Hive and the Reactor platform while working on a feature to bring ad-hoc quering capabilities in to our platform. I had great mentors who taught me how to test and debug and gave me insights into the architecture of the systems, and now I can dive right into new projects and teams without fear.

During the hackathon, I used our APIs to build a Python SDK. I wanted to showcase how Python developers can easily write big-data applications using our platform and my efforts during the hackathon demonstrated how this could be accomplished. The hack was well received and a updated version of this SDK will be made available in a future release.


Our team is working to solve a difficult problem – making Hadoop a platform upon which data applications can be built by all developers. Whether at our hackathons or at our weekly company-wide demos, we are constantly sharing and collaborating so everyone can understand the impact that they have and the context of how their contributions map to the overall vision and mission of the company.

If you’re interested in learning more about our culture and careers opportunities at Continuuity, check out http://continuuity.com/careers.

Comments

Hadoop Summit: Where is the value? Where are the apps?

Jun 24 2014, 8:00 am

Jonathan Gray, Founder & CEO of Continuuity, is an entrepreneur and software engineer with a background in open source and data. Prior to Continuuity, he was at Facebook working on projects like Facebook Messages. At startup Streamy, Jonathan was an early adopter of Hadoop and HBase committer.

Coming out of Hadoop Summit, one thing is clear to me – while there has been significant growth and success of the ecosystem, it is still early days and Hadoop is still exceptionally hard to consume for most organizations. As a result of this persistent issue, there weren’t many major announcements, nothing exceptionally new or different released, and the buzz remained largely centered on YARN and Spark, both of which are several years old.

While we saw reports of early adopting companies seeing real value created with Hadoop, the focus was more technical this year than I anticipated—from the keynotes to the breakout sessions to the show floor, this year’s summit seemed more about the endless variety of different technologies than use cases and actual return on investment realized. A brief overview of a few other trends we observed is below:

Hadoop is not quite enterprise ready…yet

Hadoop Summit generated significant discussion about whether Hadoop is truly ready for real, production enterprise use. Of particular concern is security and related issues of privacy and data policies needed for companies, especially those dealing with customer or financial information. Recent acquisitions of Hadoop security upstarts by the major Hadoop distributions indicate that this will continue to be an important area of focus in the near term.

Hadoop vs. The EDW: To Replace or To Augment

Another hot topic is whether Hadoop is a replacement for the traditional EDW or if it is only to augment and offload certain workloads. In years past, this has been much more of a debate; however this year it seems clear that most have accepted a symbiotic relationship for the time being. While I do expect this to change, it is evident today that there is a significant gap in the capabilities of the Hadoop stack compared to proprietary EDW technologies.

Hadoop is becoming more fragmented

This year it became apparent that the Hadoop ecosystem is splintering into multiple and often competing projects. Competing vendors are establishing parallel but increasingly separate stacks while differentiated vendors are marketing overlapped messages. There has been an explosion in the variety of ways to work with Hadoop and in the number of companies trying to make Hadoop consumable, and it’s becoming even more confusing to choose which path is best to follow. This is true not only for business leaders who are making decisions about Big Data projects in their company but even for knowledgeable developers.

Hadoop (still) needs to be simplified

This mass confusion in the market is undercutting companies’ ability to achieve value and realize what they want from their Big Data initiatives. A lot of attention is still being paid to the infrastructure rather than the applications, so although the disruptive value of Big Data should be at the forefront, it remains elusive for most.

The Big Data Application revolution is still forthcoming. It is still early days, Hadoop is still very difficult, and very few people understand how to work with it. That’s why we are building a platform that focuses on making Hadoop easier for developers, allowing anyone to build applications (today in Java) without worrying about the low-level infrastructure. Rather than grapple with myriad technology options, they are free to focus on what matters – turning their brilliant ideas for data into apps that solve real problems. This is where Hadoop can produce desired outcomes – in data applications that quickly provide measurable value.

Adding Jet Fuel to the Fire

Not to be left out of the new choices in the Hadoop menagerie, in case you missed it, we announced a project in collaboration with AT&T Labs: a distributed framework for real-time data processing and analytics applications, codenamed jetStream. Available in open source in Q3 2014, you can find more information about this effort in our recent blog post and at jetStream.io.

Comments

Continuuity Loom 0.9.7: Extensible cluster management

Jun 10 2014, 11:12 am

Derek Wood is focused on DevOps at Continuuity where he is building tools to manage and operate the next generation of Big Data applications. Prior to Continuuity, Derek ran large scale distributed systems at Wells Fargo and at Yahoo!, where he was the senior engineering lead for the CORE content personalization platform.

In March, we open sourced Continuuity Loom, a system for templatizing and materializing complex multi-tiered application reference architectures in public or private clouds. It is designed bottom-up to support different facets of your organization - from developers, operations and system administrators to large service providers.

After our first release, we have heard a lot of great things about how people are using Continuuity Loom and have also heard about features that have been missing. After taking in all the feedback, we are excited to announce the next version of Continuuity Loom codenamed Vela. The theme for this release is “Extensibility” as we have been working towards making Continuuity Loom integrate with your standard workflows for provisioning and managing multi-tiered applications, as well as making it easier to extend Continuuity Loom itself to your needs.

Highlights of Loom 0.9.7

  • Ability to register plugins to support surfacing of plugin defined fields for configuring providers and automators

  • Support for finer grained dependencies

  • Cluster life-cycle callback hooks for integrating with external systems like metering, metrics, etc.

  • Extend your cluster capabilities by adding new services to an existing cluster or reconfiguring existing services

  • Smart support for starting, stopping and restarting services with dependencies automatically included during service operations

  • Personalizable UI skins

  • More out-of-box Chef cookbooks including Apache Hive™ recipes and recipes for supporting Kerberos enabled clusters

Detailed overview of Loom 0.9.7 features

Cluster Reconfiguration

How many times have you had to change some configuration setting of a service on a cluster and had to remember to restart the right services in the right order for the change to be effective? The reality is, when making changes in multi-tiered application clusters, there are a lot of things to remember. Wouldn’t it be simpler if you could change the configuration that you want and let the system figure out everything else for you, ensuring the change you just made is active without any hassle? With this release of Continuuity Loom, you don’t have to worry any more about what services need to be restarted. Continuuity Loom automatically figures out service stop and start order based your service dependencies. You can find more information about how this is done here.

Add missing and manage cluster services easily

Let’s take a concrete use-case: let’s say your administrator has configured a template that allows you to create a Hadoop Cluster (HDFS, YARN, HBase, Zookeeper) with Kerberos security enabled. As a new user, you would like to try building a basic cluster with just HDFS and YARN until you are ready to add more. In that case, Continuuity Loom provides an easy way to remove services that you don’t need during creation time and then subsequently add them back to the live cluster with just a few clicks. With this new release, users will now have the ability to stop, start, and restart services without having to worry about what additional or dependent services need to be restarted.

Plugin Registration

In line with our theme of extensibility, we wanted to ensure that developers are able to write custom Automator and Provider plugins. As part of this, plugins now define the fields they require, which get surfaced in the UI and passed to the plugin during task execution. Particularly for Provider plugins, this allows you to provide different options for provisioning clusters. For example, Rackspace requires a username and API key, while Joyent requires a username, key name, and private key file. It is now possible to write your own plugins and describe the specified fields required. With the addition of this feature, you can also write support at the API level for any container (like Docker), OS, or cloud provider. You can learn more about this feature here.

Finer Grained Dependencies

Prior to this release, all service dependencies were applied at all phases of cluster creation. This created unnecessary solution space exploration and execution of dependencies when they didn’t make sense. This release includes a feature called fine grained dependency management for services. This feature allows Continuuity Loom administrators to specify required or optional service dependencies that apply during runtime or install time (applied only when the service is installed and available on the machine). This is specified during service creation so users don’t have to worry about it. It provides granular control over the deployment of services and opens up support for HA Hadoop clusters and secure hadoop clusters, which require external Kerberos Key Distribution Centers (KDCs). You can learn more about this feature here.

Cluster life-cycle callback hooks

Often times, you need the ability to integrate Continuuity Loom with other workflows that exist in your organization. This feature allows the Continuuity Loom administrator to configure a callback class to insert your own custom logic before and after cluster operations. Out of the box, Continuuity Loom provides a HTTP callback implementation for the interface ‘ClusterCallback’. You can use this feature to integrate the cluster life-cycle with monitoring systems, metering systems, and even your favorite chat application to alert when clusters are created, deleted or upon any failed operations. You can learn more about this feature here.

Personalized UI skins

When you install Continuuity Loom on-premise wouldn’t you like the ability to change the color scheme and logo to make it fit well with the other tools in your organization? With this release you will have the ability to change the color, skin and logo of your Continuuity Loom installation.

This has been an exciting release for us. Check out the Release Notes for more details about this release. Give it a spin by downloading the standalone version for your laptop and visiting the quickstart guide to spin up clusters in cloud.

Help us make Continuuity Loom better by contributing and please report any issues, bugs, or ideas.

Coming soon - Continuuity Loom for free in the Cloud

Be sure to sign up at tryloom.io to be among the first to know when the free, cloud-based version of Continuuity Loom is available.

Comments

HBaseCon: Moving Beyond the Core to Address Availability & Usability

May 19 2014, 12:58 pm

Jonathan Gray, CEO & Co-founder of Continuuity, is an entrepreneur and software engineer with a background in open source and data. Prior to Continuuity, he was at Facebook working on projects like Facebook Messages. At startup Streamy, Jonathan was an early adopter of Hadoop and HBase committer.

We just wrapped HBaseCon 2014, the annual event for Apache HBase™ contributors, developers, and users. As in years past, this is one of the most technical conferences that we attend, and it’s really focused on the core community of developers who are doing something meaningful with the enabling technology. What makes HBaseCon so compelling is that it’s not theoretical but rather all about overcoming real technical challenges and actual business use cases. And this year, we noticed a couple of key trends that are shaping the future of HBase.

Overall, we noticed that the HBase discussion has moved up a level, and this is a good thing. We’re no longer talking about the core architecture of HBase, which is pretty much set at this point. So people aren’t talking about doing the architecture better, but instead it’s all about building above what’s already there. Last year was very focused on improvements to the core platform, such as detecting server failure more quickly and recovering, and describing new use cases launching on HBase. But, in the year since, HBase has further stabilized into a mature platform and the new use cases are now established production systems. Now the conversation is around building above HBase and around it for higher availability and usability.

There was a lot of good discussion of increasing availability from an HBase standpoint. In the Facebook keynote on HydraBase, they discussed using a consensus protocol for HBase reads and writes in order to tolerate individual server failures without sacrificing availability or strong consistency. Similarly, Hortonworks and others shared work they’ve been doing on timeline consistent read replicas. For example, if a single server goes down you can still read data consistently up to a given point in time—the most updated snapshot of the data. Google’s Bigtable team also touched on availability by addressing their approach to the long tail of latency.

Multiple approaches to availability are happening, but they ultimately lead to the same goals of trying to reduce the big latency outliers and getting to 5-9s (i.e., 99.999%) reliability. In addition to early adopters like Facebook, Cloudera, and Hortonworks, we’re also encouraged to see a lot of other real users step up and take an active role in the community—we’ve seen this particularly in contributions from Salesforce, Xiaomi, and Bloomberg.

All of these companies are using HBase at very large scale, contributing to its development to continue to move it forward, and then sharing their successes with others. For us at Continuuity, HBase usability is what we’re driving at, and we’ll remain very focused on improving usability so that more developers can build their own HBase and Hadoop applications. This is where HBase is going, and we’re excited to be a part of this community and contribute to its success.

Comments

Running Presto over Apache Twill

Apr 3 2014, 11:35 am

Alvin Wang is a software engineer at Continuuity where he is building software fueling the next generation of Big Data applications. Prior to Continuuity, Alvin developed real-time processing systems at Electronic Arts and completed engineering internships at Facebook, AppliedMicro, and Cisco Systems.

We open-sourced Apache Twill with the goal of enabling developers to easily harness the power of YARN using a simple programming framework and reusable components for building distributed applications. Twill hides the complexity of YARN with a programming model that is similar to running threads. Instead of writing boilerplate code again and again for every application you write, Twill provides a simple and intuitive API for building an application over YARN.

Twill makes it super simple to integrate new technologies to run within YARN. A great example of this is Presto, an ad-hoc query framework, and in this blog, I’ll explain what it is and how we were able to make Presto run within YARN using Twill in a short period of time.

Why did we want to run Presto over Twill?

We wanted to add ad-hoc query capabilities to our flagship product, Continuuity Reactor. We looked at different frameworks and got started on experimentation with Presto because it is written in Java and is emerging as an important big data tool. The next question was how to integrate it? We opted to run Presto within YARN because it gives developers the flexibility to manage and monitor resources efficiently within a Hadoop cluster, and the capability to run multiple Presto instances.

We use Twill extensively in Reactor for running all services within YARN. So, in order for us to run Presto within Reactor, we had to integrate it with Twill.

What is Presto?

Presto is a distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Last fall, Facebook open-sourced Presto, giving the world a viable, faster alternative to Hive, the data warehousing framework for Hadoop. Presto was designed for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of large organizations.

How does Presto work?

When executing a query, the query is sent to the coordinator through the command-line interface. The coordinator distributes the workload across workers. Each worker reads and processes data for their portion of the input. The results are then sent from the workers back to the coordinator, which would then aggregate the results to form a full response to the query. Presto works much faster than Hive because it doesn’t need to run a new MapReduce job for every query, as the workers can be left running even when there aren’t any active queries.

How did we integrate Presto with Twill?

First, we needed to run Presto services (discovery, coordinator, and worker) embedded in TwillRunnables, which posed a couple of challenges:

  • The public Presto distribution provides Bash scripts and Python scripts for running Presto services, but has no documented way to run in embedded mode.
  • Presto services normally use external configuration files for various properties like discovery URI and HTTP binding port.
  • Presto on Twill needed to handle varying discovery URIs since YARN cannot guarantee that the discovery service would run on a particular host since any host could become unavailable.
  • So, we configured the Presto services programmatically:

    Bootstrap app = new Bootstrap(modules.build());
       app.setRequiredConfigurationProperty("coordinator", "true");
       app.setRequiredConfigurationProperty("datasources", "jmx");
       app.setRequiredConfigurationProperty("discovery-server.enabled", "false");
       app.setRequiredConfigurationProperty("http-server.http.port", propHttpServerPort);
       app.setRequiredConfigurationProperty("discovery.uri", propDiscoveryUri);
    

    Next, we needed to get Presto services to use an existing Hive metastore with the Hive connector so that Presto CLI can run queries against Hive tables. While Presto includes basic documentation for file-based configuration of the Hive connector, there isn’t any documentation on how to do it programmatically. To tackle this, we inspected the code that loads the Hive connectors. We found that ConnectorManager.createConnection() was setting up the connectors, but the ConnectorManager instance was a private field in CatalogManager, so we had to use reflection. While not ideal, it worked. In the future, we may contribute our source code to Presto to make it easier to embed in Java. The code we used to register the Hive connector is shown below:

    injector.getInstance(PluginManager.class).installPlugin(new HiveHadoop2Plugin());
          CatalogManager catalogManager = injector.getInstance(CatalogManager.class);
          Field connectorManagerField = CatalogManager.class.getDeclaredField("connectorManager");
          connectorManagerField.setAccessible(true);
          ConnectorManager connectorManager = (ConnectorManager) connectorManagerField.get(catalogManager);
          connectorManager.createConnection("hive", "hive-hadoop2", ImmutableMap.builder()
            .put("hive.metastore.uri", propHiveMetastoreUri)
            .build());
    

    Once we were able to run embedded Presto without Twill, we packaged Presto with all the dependency jars into a bundle jar file to avoid dependency conflicts. Then we simply configured a Twill application to run various instances of BundledJarRunnable that were running the Presto services contained within the jar file. Below is a full example of a Twill application that runs Presto’s discovery service that is packaged within a jar file using BundledJarRunnable:

    public class PrestoApplication implements TwillApplication {
    
      public static final String JAR_NAME = "presto-wrapper.jar";
      public static final File JAR_FILE = new File("presto-wrapper-1.0-SNAPSHOT.jar");
    
      @Override
      public TwillSpecification configure() {
        return TwillSpecification.Builder.with()
          .setName("PrestoApplication")
          .withRunnable()
          .add("Discovery", new BundledJarRunnable())
          .withLocalFiles().add(JAR_NAME, JAR_FILE.toURI(), false).apply()
          .anyOrder()
          .build();
      }
    
      public static void main(String[] args) {
        if (args.length ");
        }
    
        String zkStr = args[0];
    
        final TwillRunnerService twillRunner =
          new YarnTwillRunnerService(
            new YarnConfiguration(), zkStr);
        twillRunner.startAndWait();
    
        // configure BundledJarRunnable
        BundledJarRunner.Arguments discoveryArgs = new BundledJarRunner.Arguments.Builder()
            .setJarFileName(JAR_NAME)
            .setLibFolder("lib")
            .setMainClassName("com.continuuity.presto.DiscoveryServer")
            .setMainArgs(new String[] { "--port", "8411" })
            .createArguments();
    
        // run Twill application
        final TwillController controller = twillRunner.prepare(new PrestoApplication())
            .withArguments("Discovery", discoveryArgs.toArray())
            .addLogHandler(new PrinterLogHandler(new PrintWriter(System.out, true)))
            .start();
    
        Runtime.getRuntime().addShutdownHook(new Thread() {
          @Override
          public void run() {
            controller.stopAndWait();
          }
        });
    
        try {
          Services.getCompletionFuture(controller).get();
        } catch (InterruptedException e) {
          e.printStackTrace();
        } catch (ExecutionException e) {
          e.printStackTrace();
        }
      }
    }
    

    As you can see, once you have your application running from Java code, Twill makes it straightforward to write a Twill application that runs your code inside a YARN container.

    Adding new features to Twill

    During the process of getting Presto to run over Twill, we contributed a couple of new features to Twill to make it easier for anyone to implement applications that have similar needs: We’ve added support for running Twill runnables within a clean classloader and we’re currently working on allowing users to deploy Twill runnables on unique hosts. In the future, we plan to open-source our Presto work so that anyone can spin up their own Presto services in YARN, and we are also considering support for Presto in Reactor to speed up ad-hoc queries.

    Apache Twill is undergoing incubation at the Apache Software Foundation. Help us make it better by becoming a contributor.

    0 notes

    Comments
    blog comments powered by Disqus