Six Stages of Big Data Operations
In the most recent “Gartner Hype Cycle for Data Management 2018” report, the concept of DataOps appeared for the first time in the “Innovation Trigger” phase with a 5 to 10 years expected time to plateau.
What does this mean? In the five stages of the hype cycle, a technology is expected to go through the following stages of public expectations:
1. appear in the market -> innovation trigger
2. get hyped with unrealistic expectations -> peak of inflated expectations
3. get more realistic about its real value -> trough of disillusiaonment
4. get to mainstream and to be known by more people -> slope of enlightenment
5. mature in the mainstream market -> plateau of productivity
So the Hype Cycle report basically says that DataOps just appeared in the horizon of the field of data management and got recognized as one of the potential market disrupting technologies just like Spark and stream processing a few years ago. But what does DataOps really mean? Why does it only appeared almost 10 years after Hadoop led the wave of big data movement?
We will try to answer these questions by describing the six stages of big data projects and see what DataOps really brings to the table.
Stage 1: Playground
In this stage, your team may install a Hadoop cluster with Hive, probably with Sqoop, to get some data onto the cluster and run some queries. In recent years, the lineup probably would include Kafka and Spark. If you want to do some log analysis, you may install the ELK (ElasticSearch, LogStash, Kibana) stack as well.
Remember that most of these systems are complex distributed systems, and some of them need database support. Although many provide single-node mode for you to play with, your team still needs to be familiar with common devops tool like Ansible, Puppet, Chef, Fabric, and etc.
Thanks for the hard work of the open source community, playing with these tools and prototyping should be viable for most engineering teams. With a few good engineers, you probably could get the systems running in a few weeks, depending on how many components you want to install.
Stage 2: Automation
Now you have a big data system, what’s next? You may have
● some Hive queries to run them periodically, say once every hour or once every day to generate some business intelligence reports;
● some Spark programs to run some machine learning programs to generate some user analysis models so that your product system can provide personalized service;
● some crawler that needs to pull data from a remote site from time to time;
● or some streaming data process programs to create a real time dashboard to show on a big screen.
So you need a job scheduling system to run them according to a schedule or data availability. Here comes Oozie, Azkaban, Airflow, and etc. These workflow systems allow you to specify when to run the programs (like the cron program on Linux machines).
The functions of the workflow systems vary a lot. For example, some of them provide dependency management, which allows you to specify scheduling logic like job A only runs when job B and job C finishes. Some of them allow to manage only Hadoop programs while others allow more types of workflow. You have to decide one that fits your requirement the best.
Besides the workflow system, you also have other tasks that need to be automated. For example, some data on you HDFS needs to be removed after some time, e.g., if we can only keep one year of the transactions, then on the 366th day, we need to remove the data from the oldest day in the data set. This is called the data retention policy. You need to write a program that can specify and enforce the data retention policy for every data source, otherwise your storage will run out soon.
Stage 3: Production
Now you have an automated data pipeline and the data finally flows! But life in real world is tough:
● the failure rate of first year hard drives is 5.1% (similar for first year server failure rate)
● the failure rate of 4th year servers is 11%
● open source programs have a lot of bugs
● your program may have less bugs
● outside data sources have delays
● databases have down times
● network has errors
● people run “sudo rm -rf / usr/local/” with extra spaces
These problems (and more) happen more than you are planned for. If you have 50 machines with 8 hard drives each, you will have 20 hard drive failures in a year, about 2 in a month. After several months of struggling with manual processes, you finally realize that you need:
● monitoring: you need a monitor program to monitor the hardware, the operating system, the resource usage, the programs;
● observability: the systems need to tell you how they are doing so that they can be monitored;
● alert: when something goes wrong, the operation engineer needs to be notified;
● SPOF: single point of failure, if you don’t want to be called 3am in the morning, you’d better get rid of every SPOF;
● backup: you need your important data backed up as soon as possible; don’t rely on Hadoop’s 3 copies of your data, they are easily removed by some extra spaces;
● recovery: if you don’t want handle all the errors manually every time they happen, you’d want them to be automatically recoverable as much as possible.
Ok, at this stage you realize that making the system production ready is not as easy as installing a few open source programs. We are getting a bit more serious now.
Stage 4: Data Management
Making a production ready big data system involves handling not only the hardware and software failure problems, which are similar to any standard system operations, but also data related problems. If your company is really data-driven, you need to make sure your data is complete, correct, on time, and ready for evolution. What does all these mean?
- You need to know that your data is not lost during any of the steps in the data pipeline. So you need to monitor the amount of data each program is processing so that any anomaly can be detected as soon as possible;
- You need to have tests for data quality so that if any unexpected value shows up in the data, you should be alarmed;
- The programs’ running time should be monitored so that each data source has a predefined ETA, and delayed data source will be alarmed;
- Data lineage should be managed so that we know how each data source is generated, so that when something goes wrong, we know what data and results are impacted;
- Legit schema changes should be handled automatically by the system, and illegal schema changes should be reported immediately;
- You need to version your programs and relate them to the data so that when programs change, we know how the related data change accordingly.
Also, at this stage, you probably need to have a separate test environment for the data scientist to test their code. It is also important to provide tools for them to test ideas quickly and safely and then deploy to production easily.
Stage 5: Security
Now you get serious about big data: your customer-facing products are driven by data, your company management relies on real time business reports to make big decisions. Now are sure your most important asset is secure and only accessible by the right personnel? What are your authentication and authorization schemes?
One simple example is the kerberos authentication of Hadoop. If you don’t run Hadoop with kerberos integration, anyone with the root access any host in your cluster can impersonate the root user of the Hadoop cluster and access all the data. Other tools like Kafka and Spark also need kerberos for their authentication. Since it is really complex to set up these systems with kerberos (usually only commercial versions provide the support), most of the system we have seen chose to ignore the kerberos integration.
Besides the authentication problem, below are a few problems you need to deal with in this stage:
● auditing: the system has to audit all the operations in the system, e.g., who accessed what in the system
● multi-tenancy: the system has to support multiple users and groups sharing the same cluster with resource isolation and access control; they should be able to process and share their data securely and safely;
● end-to-end security: all tools in the system have to implement correct security measures, e.g., kerberos integration for all Hadoop-related components, https / SSL for all network traffic;
● single sign-on: all users in the system should have a single identity in all the tools, which is important to enforce the security policy.
Since most of the open source tools do not provide these features in their free versions, it is not surprising that many projects take the “try-my-luck” approach. We agree that security may not be of the same value to everyone, but it is essential for people to realize the potential problems and take appropriate approaches.
Stage 6: Cloud
With your business growing, more and more applications are added to the big data system. besides traditional big data systems like Hadoop/Hive/Spark, you now need to run deep learning pipelines with TensorFlow, or some Time Series analysis with InfluxDB, or probably some real time data process with Heron, or some Tomcat programs to provide data service APIs. Everytime you need to run some new programs, you find the process of provisioning the machines and setting up the production deployment is really painful. Also, at times, you need to temporarily allocate some machines to finish some extra analytics job, e.g., maybe some POC, or one time training job on a big data set.
These problems are why you need to run your big data system on a cloud infrastructure in the first place. Cloud platforms like Mesos provide great support for analytics workload as well as general workload, and provide all the benefits cloud computing technology provides: easy provision and deployment, elastic scaling, resource isolation, high resource utilization, high resilience, automatic recovery.
Another reason to run the big data system in the cloud is the evolution of the big data tools. Traditional distributed systems like MySQL cluster, Hadoop, and MongoDB cluster tend to handle their own resource management and distributed coordination. With distributed resource manager and scheduler like Mesos/Yarn, more and more distributed systems (like Spark) will rely on the underlying cloud framework to provide these distributed operating primitives. Running them in a unified framework will greatly lower the complexity and improve the operation efficiency.
We have seen many big data operations in each of the six stages. After more than 10 years of the adoption of Hadoop, most of the projects we have seen are stuck in stage 1 or 2. The problem is that implementing a system in stage 3 requires a lot of know-how and substantial investment. A Google research shows that only 5% of the time spent on building a machine learning system is for actual machine learning code, the other 95% of the time were spent on setting up the right infrastructure. Since data engineers are notoriously expensive to hire and hard to train (due to the requirement of good understanding of distributed systems), most of the companies are out of luck.
Like DevOps, DataOps is a continuous process that requires right tools and right mindset. The goal of DataOps is to make it easier to implement a big data project the right way and thus deliver the most value out of the data with less efforts. Companies like Facebook and Twitter have long been driving DataOps-like practices internally. However, their approach is usually tied to their internal tooling and existing systems, and thus is hard to generalize for others.
In the past a few years, with the technologies like Mesos and Docker, standardizing the big data operations became possible. Combined with the much broader adoption of the data-driven culture, DataOps is finally ready to taking it to the big stage. We believe the movement will lower the barrier for big data and make it much more accessible and successful for everyone.