In my last blog post, the Business Justification for Big Data, I explored common business use cases that drive companies to start Big Data projects.  In this blog post I cover Why Hadoop is disruptive, what Hadoop is and what software technology companies could stand to lose or gain the most as Hadoop obtains market share in the enterprise.

Why is Hadoop important?

Big Data is often synonymous with Hadoop, an open source Apache project that was the brain child of Google (Google File System, Map Reduce, Big Table which became HBase) and Facebook (Hive) and later had significant contributions by Yahoo (Pig) and numerous others. And while this isn’t the only Big Data solution on the market, the reason it has become so popular is because it is:

  • Flexible
  • Scalable
  • Low cost
  • Widely adopted and contributed to by leading tech companies – eBay, LinkedIn, Facebook, Yahoo, Google, Netflix

flexibleHadoop provides flexibility by allowing you to analyze large quantities of data of both structured and unstructured formats in a single repository of data. Monolithic systems like Oracle can be seen as fragile requiring downtime for schema changes or to add capacity to the system. In the world of Big Data there is an assumption that you are going to have to deal with many different types of data and quickly. A master data management project that lasts two years to standardize the format of incoming data isn’t going to cut it.  These projects set the stage for how all data needs to be transformed into a standard data format so that Extract, Transform, and Load (ETL) processes can be implemented from siloed data sources into the data warehouse.  Hadoop lets you add new data structures without the overhead of costly up front transformations before you load the data. The ‘T’, transformation, in ETL processes is considered the most costly part of any data warehousing activity.  With Hadoop, the Transformation occurs as a part of the outcome, not as a part of the data loading.  A new design pattern of ELT, T,.. T becomes possible. This allows companies and data scientists to look at the data in new ways.  And often, we are seeing companies employ Hadoop as a superior transformation technology to populate their existing data warehouses.

ScalableHadoop is built to scale. It is an elastic solution that allows you to add nodes to the system for additional processing and storage without bringing the system down.  The volume of data that companies are collecting is pushing the limits of the traditional data warehouses. Traditionally, IT spends a lot of their time trying to get rid of data so that they can meet the limitations of their data warehouse storage limit. And they spend considerable amounts of time archiving data. As a result, the potential for analytics becomes limited.

 

 

Low CostHadoop is very disruptive technology because it satisfies emerging needs, existing need while greatly lowering the cost of Big Data storage, transformation and analytics.  Storage capacity is increased simply by adding a node to the system.  And perhaps most important of all, Hadoop is designed for low cost commodity software. It ensures fault tolerance while being configured on systems that are designed to fail.

Finally, one cannot ignore the amount of open source contribution the largest and smartest tech companies are making to Hadoop.  Some of the smartest minds in the world are contributing to this project and they work for the most powerful technology companies today.

What is Hadoop?

Hadoop is a general purpose, fault tolerant, scalable data storage and processing engine. At its core, it is a file system, a processing engine, and an ecosystem of tools. The processing engine began as a batch engine and is growing into a real-time, streaming engine. The Hadoop file system (HDFS) provides a fault tolerant, high throughput, parallel processing and replication engine for unstructured data.  Hadoop also has a distributed, batch processing engine for Map and Reduce tasks referred to as MapReduce in Hadoop 1.x and YARN in Hadoop 2.x. The ecosystem of tools facilitates different scenarios of data processing and manipulation.

In the Hadoop v1.x implementation, the Map tasks do the transformation work for you. Reduce tasks reduce the data into a result set e.g. reduce all the data into a sum, an average, or a top rank.  Many of the tools like Pig and Hive abstract away these lower levels simplifying data loading and querying. Underneath the hood, both Pig and Hive queries turn into MapReduce tasks.

When data is introduced into the Hadoop file system (HDFS), it is transparently distributed and replicated across nodes in a Hadoop cluster for processing. Processing is parallelized and when it is complete, the resulting data is returned to HDFS for use by the originator.  Finally, Hadoop provides an ecosystem of tools.

What Makes Hadoop different from a traditional Data warehouse?

Ironically, the Data warehouse is a DB that is designed to be fault tolerant, scalable data storage and processing technology much like Hadoop.  So who cares and why the buzz?

The fundamental difference with Hadoop is that it is considered general purpose meaning it can store any type of data. In fact, it doesn’t care about the data structure. A Data warehouse is designed for well-structured data. It implies strong standards across your organization and rigidity in how you collect, store, and analyze data. This rigidity generally means a lot of project management and the system cannot evolve quickly which can hinder innovation.

Typical data warehouse operations deal with extremely large amounts of data. It is not unusual for a large organization to load terabytes of data nightly. Data warehouses can grow to sizes larger than petabytes or exabytes. The challenge is the amount of time required to do perform ETL and business intelligence (BI) queries.

In addition to this, many data warehouses have a scale limit e.g. a few Petabytes of storage. While a Petabyte is an enormous amount of data, it barely scratches the surface for what is needed.  To deal with this limitation, customer spend a lot of time maintaining the size of data within the data warehouse and dumping data before putting it in.  In the last year, Amazon released their Terabyte scale Data warehouse, Redshift, which changes the playing field on Data warehouse scale. But then there is always the fear of putting the heat of your business, data, in the cloud.

With Hadoop, there is no need to archive. The economics of the system whether configured on commodity hardware in your data center or hosted in the cloud are low enough to keep the data. So rather than losing this data, enterprises are leveraging Hadoop as their data reservoir.  Many Enterprises have also begun to use Hadoop within their traditional Data warehouse environments using it to offload their ETL processes.  Hadoop consolidates the data and then populates the data warehouse.  Some thought leaders at companies like eBay are starting to question if you need a separate data warehouse at all and have begun to move data into Hive as a proof of concept to see if they can replace their traditional data warehouse.  A recent drop in Teradata’s stock price was attributed to this potential market disruption.

On the flip side, while the Apache HBase and Hive projects are working to address this, Hadoop currently lacks robust support for AdHoc reads and writes and ACID based transactions. The technology for Hadoop in this space is quickly emerging however and it is expected within the next year or two for this hole to be filled.

Hadoop v2.2 was released 10/15/2033 which is the GA version of Hadoop v2.  The new platform shows maturity for many areas of the Hadoop ecosystem, but in general, the tools in this space are still emerging.  For example, there is the ongoing need for a real-time processing engine. No data warehouse on the market that I know of today is considered sufficient for real-time data. New technologies like Storm are being built on top of Hadoop to provide more real-time capabilities. Currently most companies I have seen have chosen solutions like Cassandra for real-time processing.  As these technologies mature, there will be a wider gap of differences between Hadoop and a traditional data warehouse.

In the meantime, there will be co-existence. A lot of your mission critical data will stay for some time in the traditional Data warehouse. And as your data scientists and business leaders start to ask new and more critical questions of your data, the importance of Hadoop and other Big Data solutions will become increasingly more critical.

The Hadoop Tools Ecosystem

Hadoop v1.x and v2.x are similar in terms of the tools eco-system and some of the touch points are relatively the same; however, v2.x has fundamentally shifted the architecture from a batch processing engine to a general processing engine with the introduction of Yarn and Tez. . Some of the command line interfaces in v2.x have been updated, the deprecated APIs are still supported and have a friendly warning.

 

Hadoop Tool Ecosystem v1.x

 

Hadoop becomes a general data processing platform in v2

 

Most common Hadoop Tools

  • Streaming– Used to stream live or batch data to custom Map Reduce applications in any language you like Java, Ruby, Python or C++.
  • Pig- A programming language that simplifies the common tasks of working with Hadoop: loading data, expressing transformations on the data, and calculating the final results. Pig has become the most popular tool for data transformation and queries. Pig’s built-in operations can make sense of semi-structured data, such as log files, and the language is extensible using Java to add support for custom data types and transformations.
  • Hive– Hive enables Hadoop to operate as a data warehouse. It superimposes structure on data in HDFS and then permits queries over the data using a familiar SQL-like syntax.
  • HBase– Provides a columnar storage, a clustered ‘NoSQL like; database on top of HDFS. HDFS does not provide random read and write access. HBase provides a solution for this.
  • Sqoop – Ability to easily pull data into Hadoop from a relational data and back into one.
  • Flume– A log collection tool based on traditional async queuing design patterns for reliable and scalable streaming.
  • Kafka – Similar to Flume; however, sacrifices the reliability for higher levels of throughput.

Within the Hadoop tool ecosystem, there are two very lightweight scheduling solutions. Oozie is a part of the Apache project and recently LinkedIn made their home grown tool, Azkaban, open source. Neither of these two solutions are enterprise grade yet.

Disclosure: I work for Automic Software and we provide an end-to-end, enterprise class automation solution for Big Data.

Getting up and running with Hadoop quickly in the Cloud or on premise

When looking for an enterprise, supported, on-premise solution they turn to companies like Cloudera, Hortonworks, or MAPR where they can download single instance VMs that give them a fully functional small scale environment. However, many companies do not want to handle the IT Operations of a Hadoop cluster either because they want one of three things:

  • Delay training of IT Staff
  • Get up and running quickly
  • Leverage the cloud for elasticity

As with many projects, often times, the data scientist needs to quickly spin up an interactive session with a Hadoop instance to play with a new concept. They don’t have the time to setup a Hadoop cluster and nor do they have internal IT resources familiar with Hadoop or the time to wait for IT to come up to speed. And when they need to scale the solution quickly for larger data processing they want to be able to add nodes without provisioning new machines.  As a result, they begin their BigData efforts using Cloud Platform as a Service (PaaS) offering for Hadoop such as Amazon Elastic Map Reduce. These solutions allow you to quickly spin up a pre-configured Hadoop instance within minutes.

Amazon provides a Hadoop service to process big data called EMR.  EMR provides a cluster of virtual EC2 instances which are backed by files on S3. It also provides a management console to manage the cluster through CloudWatch. CloudWatch is a generic dashboard framework, within the EMR service it provides information on the EMR Cluster status.

Amazon EMR provisions EC2 virtual machines with Hadoop Master or slave configurations to store and processes data. S3 acts as the data ‘truth’ holding the data inputs, map and task functions, result sets / outputs, and log files from the service when requested. When working with Amazon’s cloud you can call the corresponding Web services to spin up the EMR instances, add more slave nodes to speed up processing, and manage the S3 the storage file system.

In a nutshell, what this means is, you don’t need IT staff to understand Hadoop to begin working with it.  And later, once the concepts behind Hadoop are more familiar, bringing Hadoop in house will not seem so daunting.

Data Volume and Variety

IDC predicts big data technology and services will grow worldwide from $3.2 billion in 2010 to $16.9 billion in 2015.  That is a compound annual growth rate of 40% which is seven times the overall information and communications technology market.  There is no doubt that wave of Big Data is coming and it going to have an enormous impact on IT.

When thinking about HR, ERP and CRM systems we clearly understand how those data silos are standardized through master data management tools and stored into a traditional data warehouse.  But as the data becomes more abundant and diverse in its structure, problems arise.

Data variety and complexity

Big Data is a by-product of Mobility, embedded devices like sensors, social media and new technologies that allow us to use images and videos and more as sources of data.  All of these new channels of data come with new types of data structures. With variety comes complexity. Imagine trying to do a master data management project to standardize the database tables for 20 new data sources. How long would this take?

Companies today cannot afford to take two years to leverage these new sources of data. They need the agility and time to value to be able to ask new types of questions and gain new insights that give them a competitive edge. In walks Big Data.

What is unstructured data?

Data generally falls into three categories: structured, semi-structured and unstructured.  You are already familiar with structured data i.e. data that is in your relational data stores and data warehouses.

cindy table

Take for example the following database entry.

Here you have some well-structured information. The Database table expects a certain format to make the entry.

 

 

 

cindy picture

Now consider this picture of Cindy. How do you handle media data, a photograph of Cindy?

 

This type of data is commonly being used today with facial recognition software. By processing this piece of data, you may be able to find out a 75% probability that this is Cindy and with a 65% probability that she is focused and working.

 

This is considered unstructured data. So this might be a bit far out there for most people. So lets get another example of unstructured data.

Here is a line from a Web server log file:

Wed Dec 4 00:00:00 2002, fe80::14c9:b3c4:53ec:4825%12, 173.194.33.110   GET /songtitle hipsterbeware

The file doesn’t give you any metadata to interpret it. You have to be able to figure out that the first entry is a date, the second an IP v6 Address, the third an IP address, and the fourth the activity the user is doing on the website.

Semi-structured data is data that is not in a database; however, has sufficient meta-data or structure to understand the meaning of the data. Take the log file example from before and make it semi-structured:

<time>Wed Dec 4 00:00:00 2002</time>

<ip6address>fe80::14c9:b3c4:53ec:4825%12</ipv6address>

<ipaddress>173.194.33.110</ipaddress>

<click>GET /songtitle hipsterbeware</click>

Without explanation most can quickly understand the context of this data.

 

Who stands to lose as Hadoop gains a foothold in the Enterprise?

At the end of the data (pun intended), the speed of innovation within the enterprise and beyond is moving at a pace where it is no longer realistic or feasible to enforce data standards. And no smart CIO is going to allow their companies to keep throwing away data or to have data loss because of numerous lossy transformations. They will demand new types of data orchestration and processing that embraces new data sources, moves a large volume of data efficiently without taking down other critical workloads in your environment, and the ability to support the plethora of different data structures quickly.

If my understanding of Hadoop is correct, I predict that Hadoop will fundamentally change the landscape of data processing for both existing systems and new types of data processing solutions.

The tools world without Hadoop

before hadoop

 

The world coexisting with Hadoop.  In the following image Hadoop helps to augment the traditional Datawarehouse. There is occasional offloading of data trasnformation for traditional ETL processes to Hadoop and new types of analytics occurring in Hadoop.

with hadoop

 

And finally, a plausible and radical view of the world… Hadoop becomes the data warehouse.  New solutions like Storm build on Hadoop and augment it to provide real-time processing.  And the need for traditional ETL tools goes away completely.  Winners: Enterprise providers and PaaS providers of Hadoop; Big Data orchestration engines; BI Tools that provide Visualization on top of Hadoop. Losers: Master Data Management solutions, Data Transformation Engines, Traditional Data warehouses.

hadoop disrupting

 

The biggest losers will be companies that ignore Hadoop as just another tool. The ones that clearly see the disruption, and most do, will figure out how to embrace it.  There is a reason why companies like Cloudera are praised as hot new startups while well established companies like SAP are trying to tell a co-operative story with their new storage platforms like SAP Hana.

The reality is, I don’t believe technology ever really goes away. Legacy just generally goes on living and becoming more expensive to maintain over time. But I recently did listen to a case study from some folks in Sears IT say they were able to retire some COBOL code thanks to the wonders of Hadoop.  Not bad.

So in the enterprise, all of these things will likely coexist for some time.  But I foresee a world where technology forward companies emerge that never uses these legacy solutions. And I don’t buy any stock in data related companies that don’t embrace Hadoop or other Big Data related solutions at this point.

Shameless Plug: Needless to say, across all three of these pictures, you still need a data orchestration engine and data pipeline to help collect the data, push it to the cluster, trigger the analytic activities, perform IT Housekeeping and the like. And for things that are not real-time, you need scheduling and service level agreements.  There will always be a play for the data movers.