How we built it: Making Hadoop data exploration easier with Ad-hoc SQL Queries

Aug 28 2014, 10:05 am

Poorna Chandra is a software engineer at Continuuity where he is responsible for building software fueling the next generation of data applications. Prior to Continuuity, he developed big data infrastructure at Greenplum and Yahoo!
Julien Guery is a software engineering intern at Continuuity and a Masters student at French engineering school Télécom Bretagne.

We are excited to introduce a new feature added in the latest 2.3 release of Continuuity Reactor – ad-hoc querying of Datasets. Datasets are high-level abstractions over common data patterns. Reactor Datasets provide a consistent view of your data whether processed in batch or real-time. In addition to scans and RPCs of Datasets, there is also a need to explore data through an easy-to-use SQL query interface. This allows you to run ad-hoc interactive queries on Datasets, generate reports on Datasets, and integrate Reactor with the BI tool of your choice. In this blog, we will talk about how this works and how we expose an easy-to-use interface for querying data.

In keeping with our mission of providing simple access to powerful technology, we set out to enable ad-hoc querying capabilities of Datasets without adding additional overhead to the user. We accomplished this by integrating the underlying technologies that enable declarative querying into Reactor. One of our important design goals was to introduce this feature without our users having to manage additional services or frameworks outside of Reactor. Below we will share how we built this system.

Hive as the SQL engine

SQL is a widely adopted standard for ad-hoc querying, and it’s a great way to make it easy for users to interact with Datasets. While there are a number of existing SQL engines in the Hadoop ecosystem, we decided to go with Apache Hive because one of its core components, Hive Metastore (which stores metadata about Hive tables), is also used by other existing SQL engines like Presto and Impala. This means that later integrations of those engines will be easier.

Datasets are non-native, external tables from Hive’s point of view. Hive has a well-defined interface for interacting with non-native tables via Storage Handlers. To integrate Hive with Datasets, we implemented a custom Dataset Storage Handler to manage the interactions. The Dataset Storage Handler is composed of:

  • a DatasetInputFormat that knows how to retrieve Dataset Records from Datasets, and
  • a DatasetSerDe that converts the records into objects that Hive can work with.

The data flow between Hive and Datasets is depicted below:

Hive as a Service in Reactor

Typically, a Hive installation consists of a Hive Metastore server that contains the metadata and a Hive Server that runs the user queries. In our case, we can connect to an existing Hive Metastore for metadata. However, connecting to an existing Hive server has the following limitations:

  • Hive Server does not have interfaces that allow access to the complete query life cycle. This limits transaction management while dealing with Datasets.
  • Hive Server runs user code in server space, which makes the service vulnerable to bad code.
  • Hive Server’s HTTP interface does not support security and multi-tenancy.

To overcome these limitations, we created a new HTTP-based Hive Server that addresses the above issues by wrapping Hive’s CLIService and integrating with Reactor’s Datasets and transactions engine Tephra. We chose to run the Hive Server in YARN containers for proper user code and resource isolation, as well as scalability.

Hive-Reactor Integration challenges

Since Hive jars bundle disparate libraries like Procotol Buffers and Guava, we came across a recurrent classloading issue. This is because Reactor uses different versions of these libraries. We had to make sure that the classpaths of the different components that run Hive in Reactor have their jars in a particular order. We fixed the classpath order in various places by doing the following:

  • In Reactor Master: since we only launch Hive service from Reactor Master, it was sufficient to isolate Hive jars in a separate classloader to avoid any library version conflicts.
  • In the YARN container running Hive service: we set the classpath of the container in the right order during its initialization since we have control over it.
  • In the mapreduce jobs launched by Hive in the YARN cluster: we set the Hadoop configuration setting mapreduce.job.user.classpath.first to true so that Reactor jars contained in the hive.aux.jars configuration would come earlier than Hive jars in the classpath of those mapreduce jobs.
  • In the local mapreduce jobs launched in the same YARN container as the Hive service: we changed the way Hadoop classpath was set by modifying the HADOOP_CLASSPATH environment variable. We also set the environment variable HADOOP_USER_CLASSPATH_FIRST to true so that the HADOOP_CLASSPATH content would come earlier than Hive jars in the classpath of the local mapreduce jobs.

Future work

We are working hard to enable a more fun and productive user experience for data exploration in Hadoop. One of the features that we plan on introducing in the future is a JDBC driver to connect third-party BI tools to Reactor Datasets. This will make our platform even more accessible to all users who want to work with data in Hadoop.

To try out our latest new features including Ad-hoc SQL queries, download the Continuuity Reactor 2.3 SDK and check out the developer documentation to get started.

And if this work sounds exciting to you, check out our careers page and send us your resume!

Comments

Continuuity Reactor 2.3: SQL and Security Release

Jul 23 2014, 10:22 am

Alex Baranau is a software engineer at Continuuity where he is responsible for building and designing software fueling the next generation of Big Data applications. Alex is a contributor to HBase and Flume, and has created several open-sourced projects. He also writes frequently about Big Data technologies.

The Continuuity Reactor platform is designed to make it easy for developers to build and manage data applications on Apache Hadoop™ and Apache HBase™. Every day we’re passionately focused on delivering an awesome experience for all developers, with or without Hadoop expertise. And today, we’re excited to release the next version of our platform, Continuuity Reactor 2.3.

In addition to continued stability, scalability, and performance, we have added a number of significant new features in Continuuity Reactor 2.3:

Ad-hoc SQL Queries

Procedures are an existing, programmatic way to access and query your data in Reactor, but sometimes you may want to explore a Dataset in an ad-hoc manner rather than writing procedural code. Reactor now supports ad-hoc SQL queries over Datasets via a new API that allows developers to expose the schema of a Dataset and make it query-able through a REST API. This enables the submission of SQL queries over Datasets along with retrieval of the results, submitted via REST and executed via Apache Hive or other Hadoop-based SQL engines.

Security Enhancements

We’re committed to making Hadoop applications secure. Continuuity Reactor now supports perimeter security, restricting access to resources only to authenticated users. With perimeter security, access to cluster nodes is restricted by a firewall. Cluster nodes can communicate with each other, but outside clients can only communicate with the cluster through a secured gateway.

Using Reactor security, the Reactor authentication server issues credentials (access tokens) to authenticated clients, and clients then send these credentials with their requests to Reactor. Calls that lack valid access tokens are rejected, limiting access to only authenticated clients. You can learn more about the authentication process on the Reactor Security page.

Additional Release Highlights

Other key enhancements in 2.3 include new Application, Stream, Flow, and Dataset features such as:

  • Stream support for data retention policy; reconfigurable at runtime, while in use
  • Stream support for truncate via REST
  • Simplified Flowlet @Batch support with process methods no longer requiring an Iterator
  • New Datasets API that gives more power and flexibility when developing custom Datasets
  • Dataset management outside of Applications exposes REST interfaces to create, truncate, drop and discover Datasets
  • New Application API with an improved way to define application components

Finally, we have added Reactor Services, an experimental feature that allows the addition of custom User Services that can be easily discovered from within Flows, Procedures and MapReduce jobs. We’ll have more services capabilities in our next release, but you can get an early preview of one of the features we are most excited about right now!

Try Reactor 2.3 Today

We are working hard to solve the challenging problems faced by both new and experienced data application developers and to enable a much more fun and productive development experience for Hadoop. Reactor unifies the capabilities you need when developing on Hadoop into an integrated developer experience so that you can focus on your application logic without the worries of distributed system architectures or scalability. Download the Continuuity Reactor 2.3 SDK and check out the developer documentation to get started.

We are excited about the latest release and would love to hear your thoughts. Please feel free to send us feedback at support@continuuity.com.

Comments

Meet Tephra, An Open Source Transaction Engine

Jul 18 2014, 8:00 am

Gary Helmling is a software engineer at Continuuity as well as an Apache HBase committer and Project Management Committee (PMC) member. Prior to Continuuity, Gary led the development of Hadoop and HBase applications at Twitter, TrendMicro, and Meetup.

Our platform, Continuuity Reactor, uses several open source technologies in the Apache Hadoop™ ecosystem to enable any developer to build data applications. One of the major components of our platform is Apache HBase, a non-relational, massively scalable column-oriented database modeled after Google’s BigTable. We use HBase for a number of reasons, including the strong data consistency it provides. One of the limitations of HBase as a standalone system, however, is that data updates are consistent only within a single region, or a set of contiguous rows, because it is very difficult to coordinate updates across these regions in a way that maintains scalability.

As a result, one of the tradeoffs is that HBase maintains consistency for a single row or region of rows, but anything across regions or tables, cannot be updated atomically—i.e., where the entire transaction is committed as one—nor can you do an atomic update that spans multiple remote procedure calls (RPCs). While we value what HBase provides, we believe providing globally consistent transactions simplifies application development a great deal, allowing developers to focus more on the problems and use cases they care about rather than on implementing complex data access patterns.

This is why we built Tephra, a distributed, scalable transaction engine designed for HBase and Hadoop. Tephra can also be extended to integrate with other NoSQL systems like MongoDB and LevelDB as well as traditional relational databases and data warehouses. Tephra is a powerful data management tool that makes a wide range of use cases easier to solve, especially online and OLTP applications. It utilizes the key features of HBase to make transactional capabilities available without sacrificing overall performance.

Today we’re open sourcing Tephra for anyone to use because we believe that the broader developer community can benefit from it, and for anyone to contribute to because we have built Tephra with extensibility in mind.

How can developers use Tephra?

One common use case is secondary indexes. Developers typically create secondary indexes on HBase by writing updates to a second table with additional rows that reference the rows in the main table based on the index values. The problem is that there isn’t consistency in operations across the two tables, so they can get out of sync. Based on their actual data access patterns and what their application cares about, developers are forced to adopt more complicated application logic to manage the data and work around the inconsistencies. In contrast, Tephra simplifies this use case by allowing updates to both tables to be performed in a single globally consistent transaction.

Why are we open sourcing Tephra?

Many developers and companies are successfully using HBase, but there are still gaps in its accessibility to developers. Tephra takes the strong foundation that HBase has given us to build upon and enhances it by making it more developer-friendly and broadening the potential users and use cases of HBase. We are open sourcing the technology because we want to give back to the community and believe Tephra will be useful to a broad range of developers.

We also are excited to see how others will use, apply, and extend Tephra transactions to their own applications, infrastructures, and environments. We recognize that developers have specific needs, some of which we haven’t anticipated, and we look forward to Tephra growing as a project and community.

Learn more and get involved

Check out the release notes or our slideshare for more details about Tephra. And please help us make the project better by joining our user and developer mailing list and contributing and reporting any issues, bugs, or ideas.

Comments

Behind the scenes: Hacking our way to success

Jul 7 2014, 8:47 pm

Sreevatsan Raman is a software engineer at Continuuity where he is building and architecting a platform fueling the next generation of Big Data applications. Prior to Continuuity, Sree designed and implemented big data infrastructure at Klout, Nominum, and Yahoo!

We just wrapped up our latest hackathon and it was a great reminder of the unique engineering culture we have at Continuuity. We have created a new application development platform, Continuuity Reactor, which is focused on allowing developers to quickly and easily build Big Data applications.

Building a platform that no one has created before is a big challenge. We break this huge effort into a continuous cadence of platform releases that are delivered to production frequently. Before every release we take a break from our daily efforts and hack on our platform for 48 hours where we stretch our imaginations and the platform capabilities we just built.

Every hackathon gives us an opportunity to dog-food our technology. We come together wearing our developer hats to build features and applications, incorporating our lessons learned into continually improving the developer experience, with the goal of making Hadoop more simple and accessible.

One of my favorite aspects of our hackathons is how the whole company comes together to build cool stuff and have fun. From our CEO to our engineering team to people in non-technical roles, everyone participates. Here are some thoughts and experiences about our company, culture, and hackathons from our awesome engineering interns:

Shu Das, University of Michigan

The unique aspect of Continuuity that I like is that everyone has a clear sense of his or her agenda and responsibilities, so we’re empowered to stay on top of our game. Not only do I have the resources I need and responsiveness from the rest of the team, but also the working environment at Continuuity is lively and enjoyable.

My first project was building an application on Reactor that visualizes data about the test cases we run on our code. This work gave me great insights into what our platform is, how to use it, and how our technology can be used for simplifying Hadoop. I really appreciate the fact that the feature I worked on is used daily, as a component of the development lifecycle, and not left off as a side project.

For the hackathon, I teamed up with Kenneth and Gourav (see below) to build a Reactor application that can be used to aggregate, correlate, and visualize data - for instance, metrics, logs, or any other events. It was amazing to see the application built in a very short amount of time using new core functionalities of the platform and dogfooding the new APIs, runtime, and documentation.

Gourav Khaneja, University of Illinois

The work here is interesting because the problems we’re solving are hard. One of my favorite aspects of Continuuity is the willingness of team members to help each other to work through challenges. For example, even during crunch time, every Continuuity member is willing to stop what he or she is doing to help out a fellow employee. I learn a lot from the team on a daily basis.

When I joined, I was tasked with optimizing resource allocation in YARN using Apache Twill. YARN has a large codebase and although my previous experience with a large code base was limited, I was able to come up to speed quickly with great mentorship from the team and contributed towards a major feature in Twill.

Kenneth Le, University of California, Berkeley

Interns are involved in relevant projects right away. While we receive guidance when needed, the focus of the internship program is more on empowering us to deliver and learning more via open communication about the various projects that other people are working on.

My first project was improving a developer tool that is used to deploy code to clusters. The existing tool took about 30 minutes to build and deploy the entire code base. The newer version, which I rewrote in Python, takes about 6 minutes, thus saving developers a lot of time in their development life-cycle.

Julien Guery, Ecole nationale supérieure des Télécommunications de Bretagne

This is an extremely technical company solving challenging problems. One of the first things I noticed is that the interns get to be part of the core engineering team and are involved in all aspects of the company.

In my first project I learned a lot about Apache Hive and the Reactor platform while working on a feature to bring ad-hoc quering capabilities in to our platform. I had great mentors who taught me how to test and debug and gave me insights into the architecture of the systems, and now I can dive right into new projects and teams without fear.

During the hackathon, I used our APIs to build a Python SDK. I wanted to showcase how Python developers can easily write big-data applications using our platform and my efforts during the hackathon demonstrated how this could be accomplished. The hack was well received and a updated version of this SDK will be made available in a future release.


Our team is working to solve a difficult problem – making Hadoop a platform upon which data applications can be built by all developers. Whether at our hackathons or at our weekly company-wide demos, we are constantly sharing and collaborating so everyone can understand the impact that they have and the context of how their contributions map to the overall vision and mission of the company.

If you’re interested in learning more about our culture and careers opportunities at Continuuity, check out http://continuuity.com/careers.

Comments

Continuuity & AT&T Labs to Open Source Real-Time Data Processing Framework

Jun 3 2014, 10:32 am

Nitin Motgi is Co-founder of Continuuity, where he leads engineering. Prior to Continuuity, Nitin was at Yahoo! working on a large-scale content optimization system externally known as C.O.R.E. He previously held senior engineering roles at Altera and FedEx.

Today we announced an exciting collaborative effort with AT&T Labs that will facilitate the integration of Continuuity BigFlow, our distributed framework for building durable high-throughput real-time data processing applications, with AT&T’s streaming analytics tool, an extremely fast, low-latency streaming analytics database originally built out of the necessity for managing its network at scale. The outcome of this joint endeavor is to make a new project, codenamed jetStream, available to the market as an Apache-licensed open source project with general availability in the third quarter of 2014.

Why are we combining our technologies?

We decided to bring together the complementary functionality of BigFlow and AT&T’s streaming analytics tool to create a unified real-time framework that combines in-memory stream processing with model-based event processing including direct integration for a variety of existing data systems like Apache HBase™ and HDFS. By combining AT&T’s low-latency and declarative language support with BigFlow’s durable, high-throughput computing capabilities and procedural language support, jetStream provides developers with a new way to take in and store vast quantities of data, build massively scalable applications, and update your applications in real-time as new data is ingested.

Moving to real-time data applications

When you look at the wealth of data being generated and processed, and the opportunity within that data, giving more organizations the ability to make informed, real-time decisions with data is critical. We believe that the next commercial opportunity in big data is moving beyond ad-hoc, batch analysis to a real-time model where applications serve relevant data continuously to business users and consumers.

Open sourcing jetStream and making it available within Continuuity Reactor will enable enterprises and developers to create a wide range of big data analytics and streaming applications that address a broad set of business use cases. Examples of these include network intrusion detection and network analytics, real-time analysis for spam filtering, social media market analysis, location analytics, and real-time recommendation engines that match relevant content to the right users at the right time.

New developer features

By using jetStream, developers will be able to do the following:

  • Direct integration of real-time data ingestion and processing applications with Hadoop and HBase and utilization of YARN for deployment and resource management

  • Framework-level correctness, fault tolerance guarantees, and application logic scalability that reduces friction, errors, and bugs during development

  • A transaction engine that provides delivery, isolation and consistency guarantees that enable exactly-once processing semantics

  • Scalability without increased operational cost of building and maintaining applications

  • Develop pipelines that combine in-memory continuous query semantics with persistent, procedural event processing with simple Java APIs

For more information, please visit jetStream.io.

Comments

HBaseCon: Moving Beyond the Core to Address Availability & Usability

May 19 2014, 12:58 pm

Jonathan Gray, CEO & Co-founder of Continuuity, is an entrepreneur and software engineer with a background in open source and data. Prior to Continuuity, he was at Facebook working on projects like Facebook Messages. At startup Streamy, Jonathan was an early adopter of Hadoop and HBase committer.

We just wrapped HBaseCon 2014, the annual event for Apache HBase™ contributors, developers, and users. As in years past, this is one of the most technical conferences that we attend, and it’s really focused on the core community of developers who are doing something meaningful with the enabling technology. What makes HBaseCon so compelling is that it’s not theoretical but rather all about overcoming real technical challenges and actual business use cases. And this year, we noticed a couple of key trends that are shaping the future of HBase.

Overall, we noticed that the HBase discussion has moved up a level, and this is a good thing. We’re no longer talking about the core architecture of HBase, which is pretty much set at this point. So people aren’t talking about doing the architecture better, but instead it’s all about building above what’s already there. Last year was very focused on improvements to the core platform, such as detecting server failure more quickly and recovering, and describing new use cases launching on HBase. But, in the year since, HBase has further stabilized into a mature platform and the new use cases are now established production systems. Now the conversation is around building above HBase and around it for higher availability and usability.

There was a lot of good discussion of increasing availability from an HBase standpoint. In the Facebook keynote on HydraBase, they discussed using a consensus protocol for HBase reads and writes in order to tolerate individual server failures without sacrificing availability or strong consistency. Similarly, Hortonworks and others shared work they’ve been doing on timeline consistent read replicas. For example, if a single server goes down you can still read data consistently up to a given point in time—the most updated snapshot of the data. Google’s Bigtable team also touched on availability by addressing their approach to the long tail of latency.

Multiple approaches to availability are happening, but they ultimately lead to the same goals of trying to reduce the big latency outliers and getting to 5-9s (i.e., 99.999%) reliability. In addition to early adopters like Facebook, Cloudera, and Hortonworks, we’re also encouraged to see a lot of other real users step up and take an active role in the community—we’ve seen this particularly in contributions from Salesforce, Xiaomi, and Bloomberg.

All of these companies are using HBase at very large scale, contributing to its development to continue to move it forward, and then sharing their successes with others. For us at Continuuity, HBase usability is what we’re driving at, and we’ll remain very focused on improving usability so that more developers can build their own HBase and Hadoop applications. This is where HBase is going, and we’re excited to be a part of this community and contribute to its success.

Comments
blog comments powered by Disqus