How we built it: Making Hadoop data exploration easier with Ad-hoc SQL Queries

Aug 28 2014, 10:05 am

Poorna Chandra is a software engineer at Continuuity where he is responsible for building software fueling the next generation of data applications. Prior to Continuuity, he developed big data infrastructure at Greenplum and Yahoo!
Julien Guery is a software engineering intern at Continuuity and a Masters student at French engineering school Télécom Bretagne.

We are excited to introduce a new feature added in the latest 2.3 release of Continuuity Reactor – ad-hoc querying of Datasets. Datasets are high-level abstractions over common data patterns. Reactor Datasets provide a consistent view of your data whether processed in batch or real-time. In addition to scans and RPCs of Datasets, there is also a need to explore data through an easy-to-use SQL query interface. This allows you to run ad-hoc interactive queries on Datasets, generate reports on Datasets, and integrate Reactor with the BI tool of your choice. In this blog, we will talk about how this works and how we expose an easy-to-use interface for querying data.

In keeping with our mission of providing simple access to powerful technology, we set out to enable ad-hoc querying capabilities of Datasets without adding additional overhead to the user. We accomplished this by integrating the underlying technologies that enable declarative querying into Reactor. One of our important design goals was to introduce this feature without our users having to manage additional services or frameworks outside of Reactor. Below we will share how we built this system.

Hive as the SQL engine

SQL is a widely adopted standard for ad-hoc querying, and it’s a great way to make it easy for users to interact with Datasets. While there are a number of existing SQL engines in the Hadoop ecosystem, we decided to go with Apache Hive because one of its core components, Hive Metastore (which stores metadata about Hive tables), is also used by other existing SQL engines like Presto and Impala. This means that later integrations of those engines will be easier.

Datasets are non-native, external tables from Hive’s point of view. Hive has a well-defined interface for interacting with non-native tables via Storage Handlers. To integrate Hive with Datasets, we implemented a custom Dataset Storage Handler to manage the interactions. The Dataset Storage Handler is composed of:

  • a DatasetInputFormat that knows how to retrieve Dataset Records from Datasets, and
  • a DatasetSerDe that converts the records into objects that Hive can work with.

The data flow between Hive and Datasets is depicted below:

Hive as a Service in Reactor

Typically, a Hive installation consists of a Hive Metastore server that contains the metadata and a Hive Server that runs the user queries. In our case, we can connect to an existing Hive Metastore for metadata. However, connecting to an existing Hive server has the following limitations:

  • Hive Server does not have interfaces that allow access to the complete query life cycle. This limits transaction management while dealing with Datasets.
  • Hive Server runs user code in server space, which makes the service vulnerable to bad code.
  • Hive Server’s HTTP interface does not support security and multi-tenancy.

To overcome these limitations, we created a new HTTP-based Hive Server that addresses the above issues by wrapping Hive’s CLIService and integrating with Reactor’s Datasets and transactions engine Tephra. We chose to run the Hive Server in YARN containers for proper user code and resource isolation, as well as scalability.

Hive-Reactor Integration challenges

Since Hive jars bundle disparate libraries like Procotol Buffers and Guava, we came across a recurrent classloading issue. This is because Reactor uses different versions of these libraries. We had to make sure that the classpaths of the different components that run Hive in Reactor have their jars in a particular order. We fixed the classpath order in various places by doing the following:

  • In Reactor Master: since we only launch Hive service from Reactor Master, it was sufficient to isolate Hive jars in a separate classloader to avoid any library version conflicts.
  • In the YARN container running Hive service: we set the classpath of the container in the right order during its initialization since we have control over it.
  • In the mapreduce jobs launched by Hive in the YARN cluster: we set the Hadoop configuration setting mapreduce.job.user.classpath.first to true so that Reactor jars contained in the hive.aux.jars configuration would come earlier than Hive jars in the classpath of those mapreduce jobs.
  • In the local mapreduce jobs launched in the same YARN container as the Hive service: we changed the way Hadoop classpath was set by modifying the HADOOP_CLASSPATH environment variable. We also set the environment variable HADOOP_USER_CLASSPATH_FIRST to true so that the HADOOP_CLASSPATH content would come earlier than Hive jars in the classpath of the local mapreduce jobs.

Future work

We are working hard to enable a more fun and productive user experience for data exploration in Hadoop. One of the features that we plan on introducing in the future is a JDBC driver to connect third-party BI tools to Reactor Datasets. This will make our platform even more accessible to all users who want to work with data in Hadoop.

To try out our latest new features including Ad-hoc SQL queries, download the Continuuity Reactor 2.3 SDK and check out the developer documentation to get started.

And if this work sounds exciting to you, check out our careers page and send us your resume!

Comments

Continuuity Reactor 2.3: SQL and Security Release

Jul 23 2014, 10:22 am

Alex Baranau is a software engineer at Continuuity where he is responsible for building and designing software fueling the next generation of Big Data applications. Alex is a contributor to HBase and Flume, and has created several open-sourced projects. He also writes frequently about Big Data technologies.

The Continuuity Reactor platform is designed to make it easy for developers to build and manage data applications on Apache Hadoop™ and Apache HBase™. Every day we’re passionately focused on delivering an awesome experience for all developers, with or without Hadoop expertise. And today, we’re excited to release the next version of our platform, Continuuity Reactor 2.3.

In addition to continued stability, scalability, and performance, we have added a number of significant new features in Continuuity Reactor 2.3:

Ad-hoc SQL Queries

Procedures are an existing, programmatic way to access and query your data in Reactor, but sometimes you may want to explore a Dataset in an ad-hoc manner rather than writing procedural code. Reactor now supports ad-hoc SQL queries over Datasets via a new API that allows developers to expose the schema of a Dataset and make it query-able through a REST API. This enables the submission of SQL queries over Datasets along with retrieval of the results, submitted via REST and executed via Apache Hive or other Hadoop-based SQL engines.

Security Enhancements

We’re committed to making Hadoop applications secure. Continuuity Reactor now supports perimeter security, restricting access to resources only to authenticated users. With perimeter security, access to cluster nodes is restricted by a firewall. Cluster nodes can communicate with each other, but outside clients can only communicate with the cluster through a secured gateway.

Using Reactor security, the Reactor authentication server issues credentials (access tokens) to authenticated clients, and clients then send these credentials with their requests to Reactor. Calls that lack valid access tokens are rejected, limiting access to only authenticated clients. You can learn more about the authentication process on the Reactor Security page.

Additional Release Highlights

Other key enhancements in 2.3 include new Application, Stream, Flow, and Dataset features such as:

  • Stream support for data retention policy; reconfigurable at runtime, while in use
  • Stream support for truncate via REST
  • Simplified Flowlet @Batch support with process methods no longer requiring an Iterator
  • New Datasets API that gives more power and flexibility when developing custom Datasets
  • Dataset management outside of Applications exposes REST interfaces to create, truncate, drop and discover Datasets
  • New Application API with an improved way to define application components

Finally, we have added Reactor Services, an experimental feature that allows the addition of custom User Services that can be easily discovered from within Flows, Procedures and MapReduce jobs. We’ll have more services capabilities in our next release, but you can get an early preview of one of the features we are most excited about right now!

Try Reactor 2.3 Today

We are working hard to solve the challenging problems faced by both new and experienced data application developers and to enable a much more fun and productive development experience for Hadoop. Reactor unifies the capabilities you need when developing on Hadoop into an integrated developer experience so that you can focus on your application logic without the worries of distributed system architectures or scalability. Download the Continuuity Reactor 2.3 SDK and check out the developer documentation to get started.

We are excited about the latest release and would love to hear your thoughts. Please feel free to send us feedback at support@continuuity.com.

Comments

Meet Tephra, An Open Source Transaction Engine

Jul 18 2014, 8:00 am

Gary Helmling is a software engineer at Continuuity as well as an Apache HBase committer and Project Management Committee (PMC) member. Prior to Continuuity, Gary led the development of Hadoop and HBase applications at Twitter, TrendMicro, and Meetup.

Our platform, Continuuity Reactor, uses several open source technologies in the Apache Hadoop™ ecosystem to enable any developer to build data applications. One of the major components of our platform is Apache HBase, a non-relational, massively scalable column-oriented database modeled after Google’s BigTable. We use HBase for a number of reasons, including the strong data consistency it provides. One of the limitations of HBase as a standalone system, however, is that data updates are consistent only within a single region, or a set of contiguous rows, because it is very difficult to coordinate updates across these regions in a way that maintains scalability.

As a result, one of the tradeoffs is that HBase maintains consistency for a single row or region of rows, but anything across regions or tables, cannot be updated atomically—i.e., where the entire transaction is committed as one—nor can you do an atomic update that spans multiple remote procedure calls (RPCs). While we value what HBase provides, we believe providing globally consistent transactions simplifies application development a great deal, allowing developers to focus more on the problems and use cases they care about rather than on implementing complex data access patterns.

This is why we built Tephra, a distributed, scalable transaction engine designed for HBase and Hadoop. Tephra can also be extended to integrate with other NoSQL systems like MongoDB and LevelDB as well as traditional relational databases and data warehouses. Tephra is a powerful data management tool that makes a wide range of use cases easier to solve, especially online and OLTP applications. It utilizes the key features of HBase to make transactional capabilities available without sacrificing overall performance.

Today we’re open sourcing Tephra for anyone to use because we believe that the broader developer community can benefit from it, and for anyone to contribute to because we have built Tephra with extensibility in mind.

How can developers use Tephra?

One common use case is secondary indexes. Developers typically create secondary indexes on HBase by writing updates to a second table with additional rows that reference the rows in the main table based on the index values. The problem is that there isn’t consistency in operations across the two tables, so they can get out of sync. Based on their actual data access patterns and what their application cares about, developers are forced to adopt more complicated application logic to manage the data and work around the inconsistencies. In contrast, Tephra simplifies this use case by allowing updates to both tables to be performed in a single globally consistent transaction.

Why are we open sourcing Tephra?

Many developers and companies are successfully using HBase, but there are still gaps in its accessibility to developers. Tephra takes the strong foundation that HBase has given us to build upon and enhances it by making it more developer-friendly and broadening the potential users and use cases of HBase. We are open sourcing the technology because we want to give back to the community and believe Tephra will be useful to a broad range of developers.

We also are excited to see how others will use, apply, and extend Tephra transactions to their own applications, infrastructures, and environments. We recognize that developers have specific needs, some of which we haven’t anticipated, and we look forward to Tephra growing as a project and community.

Learn more and get involved

Check out the release notes or our slideshare for more details about Tephra. And please help us make the project better by joining our user and developer mailing list and contributing and reporting any issues, bugs, or ideas.

Comments
blog comments powered by Disqus