Hadoop Summit: Where is the value? Where are the apps?

Jun 24 2014, 8:00 am

Jonathan Gray, Founder & CEO of Continuuity, is an entrepreneur and software engineer with a background in open source and data. Prior to Continuuity, he was at Facebook working on projects like Facebook Messages. At startup Streamy, Jonathan was an early adopter of Hadoop and HBase committer.

Coming out of Hadoop Summit, one thing is clear to me – while there has been significant growth and success of the ecosystem, it is still early days and Hadoop is still exceptionally hard to consume for most organizations. As a result of this persistent issue, there weren’t many major announcements, nothing exceptionally new or different released, and the buzz remained largely centered on YARN and Spark, both of which are several years old.

While we saw reports of early adopting companies seeing real value created with Hadoop, the focus was more technical this year than I anticipated—from the keynotes to the breakout sessions to the show floor, this year’s summit seemed more about the endless variety of different technologies than use cases and actual return on investment realized. A brief overview of a few other trends we observed is below:

Hadoop is not quite enterprise ready…yet

Hadoop Summit generated significant discussion about whether Hadoop is truly ready for real, production enterprise use. Of particular concern is security and related issues of privacy and data policies needed for companies, especially those dealing with customer or financial information. Recent acquisitions of Hadoop security upstarts by the major Hadoop distributions indicate that this will continue to be an important area of focus in the near term.

Hadoop vs. The EDW: To Replace or To Augment

Another hot topic is whether Hadoop is a replacement for the traditional EDW or if it is only to augment and offload certain workloads. In years past, this has been much more of a debate; however this year it seems clear that most have accepted a symbiotic relationship for the time being. While I do expect this to change, it is evident today that there is a significant gap in the capabilities of the Hadoop stack compared to proprietary EDW technologies.

Hadoop is becoming more fragmented

This year it became apparent that the Hadoop ecosystem is splintering into multiple and often competing projects. Competing vendors are establishing parallel but increasingly separate stacks while differentiated vendors are marketing overlapped messages. There has been an explosion in the variety of ways to work with Hadoop and in the number of companies trying to make Hadoop consumable, and it’s becoming even more confusing to choose which path is best to follow. This is true not only for business leaders who are making decisions about Big Data projects in their company but even for knowledgeable developers.

Hadoop (still) needs to be simplified

This mass confusion in the market is undercutting companies’ ability to achieve value and realize what they want from their Big Data initiatives. A lot of attention is still being paid to the infrastructure rather than the applications, so although the disruptive value of Big Data should be at the forefront, it remains elusive for most.

The Big Data Application revolution is still forthcoming. It is still early days, Hadoop is still very difficult, and very few people understand how to work with it. That’s why we are building a platform that focuses on making Hadoop easier for developers, allowing anyone to build applications (today in Java) without worrying about the low-level infrastructure. Rather than grapple with myriad technology options, they are free to focus on what matters – turning their brilliant ideas for data into apps that solve real problems. This is where Hadoop can produce desired outcomes – in data applications that quickly provide measurable value.

Adding Jet Fuel to the Fire

Not to be left out of the new choices in the Hadoop menagerie, in case you missed it, we announced a project in collaboration with AT&T Labs: a distributed framework for real-time data processing and analytics applications, codenamed jetStream. Available in open source in Q3 2014, you can find more information about this effort in our recent blog post and at jetStream.io.

Comments

Continuuity & AT&T Labs to Open Source Real-Time Data Processing Framework

Jun 3 2014, 10:32 am

Nitin Motgi is Co-founder of Continuuity, where he leads engineering. Prior to Continuuity, Nitin was at Yahoo! working on a large-scale content optimization system externally known as C.O.R.E. He previously held senior engineering roles at Altera and FedEx.

Today we announced an exciting collaborative effort with AT&T Labs that will facilitate the integration of Continuuity BigFlow, our distributed framework for building durable high-throughput real-time data processing applications, with AT&T’s streaming analytics tool, an extremely fast, low-latency streaming analytics database originally built out of the necessity for managing its network at scale. The outcome of this joint endeavor is to make a new project, codenamed jetStream, available to the market as an Apache-licensed open source project with general availability in the third quarter of 2014.

Why are we combining our technologies?

We decided to bring together the complementary functionality of BigFlow and AT&T’s streaming analytics tool to create a unified real-time framework that combines in-memory stream processing with model-based event processing including direct integration for a variety of existing data systems like Apache HBase™ and HDFS. By combining AT&T’s low-latency and declarative language support with BigFlow’s durable, high-throughput computing capabilities and procedural language support, jetStream provides developers with a new way to take in and store vast quantities of data, build massively scalable applications, and update your applications in real-time as new data is ingested.

Moving to real-time data applications

When you look at the wealth of data being generated and processed, and the opportunity within that data, giving more organizations the ability to make informed, real-time decisions with data is critical. We believe that the next commercial opportunity in big data is moving beyond ad-hoc, batch analysis to a real-time model where applications serve relevant data continuously to business users and consumers.

Open sourcing jetStream and making it available within Continuuity Reactor will enable enterprises and developers to create a wide range of big data analytics and streaming applications that address a broad set of business use cases. Examples of these include network intrusion detection and network analytics, real-time analysis for spam filtering, social media market analysis, location analytics, and real-time recommendation engines that match relevant content to the right users at the right time.

New developer features

By using jetStream, developers will be able to do the following:

  • Direct integration of real-time data ingestion and processing applications with Hadoop and HBase and utilization of YARN for deployment and resource management

  • Framework-level correctness, fault tolerance guarantees, and application logic scalability that reduces friction, errors, and bugs during development

  • A transaction engine that provides delivery, isolation and consistency guarantees that enable exactly-once processing semantics

  • Scalability without increased operational cost of building and maintaining applications

  • Develop pipelines that combine in-memory continuous query semantics with persistent, procedural event processing with simple Java APIs

For more information, please visit jetStream.io.

Comments
blog comments powered by Disqus