In the last blog post, we discussed why there is a lot of experimentation in the big data world, and also why most big data experiments never make it into production. This was famously noted with a late 2016 Gartner press release which stated, “Only 15 percent of businesses reported deploying their big data project to production.”
In this blog post, we will walk through what you can do to use big data automation to overcome the top 5 technical challenges that block organizations from fully taking advantage of big data. The good news is that you no longer need an army of experts to make this all work. There has been a ton of investment in the space to automate away the complexity, and make it possible to build end-to-end big data pipelines with little to no big data or Hadoop expertise.
If you haven’t read the previous blog, there is no need to go back and reread it. I have included the problem and the response altogether in this post. Here are our top 5 technical reasons that big data projects don’t make it into production AND, what you can do about it:
Challenge 1 : Can’t load data fast enough to meet SLAs.
While tools like sqoop support parallelization for data ingest to get data from legacy sources into a data lake, you need an expert to make it work. How do you partition the data? Do you need to run 10 mappers or 20? How do you know? If you can’t properly parallelize the ingest of data, ingestion tasks that could be done in an hour can take 10 to 20 times longer. The problem is that most people don’t know how to tune this properly.
- Solution: First of all, don’t hand code a solution here. There are many vendors tackling this problem on top of Hadoop so you don’t have to write a bunch of code to solve the data ingest problem. You have some choice in this area, and while I of course think that Infoworks is the best, you really do have some options here, so absolutely under no circumstances should you be hand coding.
- So when you are evaluating vendors, make sure to consider how much automation they provide to minimize or eliminate hand coding altogether. Do they connect to all of the data sources you care about? If they don’t automate the entire process, how difficult are the areas they don’t automate? Does the vendor provide fast, parallel, native path access to your sources, or is it JDBC? You also want to also be careful to look for pretty user interfaces that look like they don’t require coding, but when you double click on an icon, you find a bunch of ugly code underneath that you will have to write yourself.
- Of course, Infoworks is one of the vendors that addresses this issue with full automation, and no coding required… and as you are reading our blog, I will leave it as an exercise for the reader to find our competitors in this space.
Challenge 2: Can’t incrementally load data to meet SLAs. Most organizations aren’t moving their entire operations onto a big data environment. They move data there from existing operational systems to perform new kinds of analysis or machine learning. This means that they need to keep loading new data as it arrives. The problem is that these big data environments don’t support the concept of adds, deletes or inserts. This means you have to reload the entire data set again (see point 1 above) or you have to code your way around this classic change data capture problem.
- Solution: Once again, the solution is to automate the process and many of the vendors who automate ingest problem help with this process as well. Note that you need to be able to deal with two challenges here. The first is change data capture on the source. You have to have a way to identify that new rows or columns have been added to the source system and then move just those changed rows or columns into your data lake. The second challenge is handling the merge and synching of the new data into the target big data system, which once again, doesn’t support the concepts of add, deletes or inserts. That means that whichever ingest vendor you chose, they better take care of this issue for you as well. Note that as of this writing there are some new open source technologies better support these concepts, but also as of this writing, they are not very mature and not very good.
- With that as background, once again, Infoworks address the incremental ingestion of data and fully automates it. In fact, the amount of effort that is required by you to configure Infoworks is to choose the type of approach you want to take to monitor changing data on the source. For this, we give you a simple pull-down menu and a single check box to select.In addition, we not only monitor for new rows being added. If there is a column added to a table on the source you are ingesting, we will detect that as well and automatically add that new column to the ingest process and merge it properly into the data lake.
Challenge 3: Can’t provide reporting access to data interactively. Imagine you have 1000 BI analysts, and none of them want to use your data models because they take too long to query. Actually, you only need one data analyst to make this unbearable. This is a classic problem with Hadoop and is the reason why lots of companies only use Hadoop for preprocessing and applying specific machine learning algorithms but then move the final data set back to a traditional data warehouse for use by a BI tool. Regardless, this adds yet one more step in the process that gets in the way of successfully completing a big data project.
- Solution: Once again, there are lots of companies that provide solutions that can take files in HDFS or Hive and generate OLAP cubes that can then be accessed from visualization tools like Tableau via JDBC/ODBC. All of these solutions operate on the same basic principle that pre-calculate the cube and then leverage the distributed computing power of the cluster to present the OLAP cube to a BI visualization layer. The details may differ, but the basic concepts are all the same and they allow a Hadoop based environment, which isn’t known to be very good for interactive queries, into an environment that can be used for interactive queries. And of course, this is yet another area that Infoworks also automates.
Challenge 4: Can’t migrate from test to production. Many organizations have been able to identify the potential for new insights from the data scientist working within their sandbox environment. Once they have identified a new “recipe” for analytics, they need to move from an individual data scientist running this analysis in their sandbox to a production environment that can run every day. Moving from dev to production is a complete lift and shift operation that is generally done manually. And while it ran just fine on the dev cluster, now that same data pipeline has to be re-optimized on the production cluster. This tuning can often require significant rework to get it to perform efficiently. This is especially true if the dev environment is in any way different from the production environment.
- Solution: The challenge here is that in this case ,there isn’t a long list of vendors who actually tackle this problem. There are a lot of “data prep” applications out there that are great for data scientists who are basically mining the data and prototyping potential “recipes” that could be used for decision making. But once they discover these recipes they leave it as an exercise for the user to convert the query or analytic or machine learning algorithm into a repeatable process that can be continuously run at scale.
- The obvious answer once again to this challenge is automation. This is a case where also, once again, Infoworks automates the process of promoting a project from dev, to text to production. Along the way, Infoworks automatically adjusts and optimizes your data pipelines to take advantage of the size of the production cluster. No recoding, or reimplementation of the pipeline is required. This means that the same self-service that data scientists are taking advantage of for data discovery, can be delivered as well, all the way through to the push to full production.
Challenge 5: Can’t manage end to end production workloads. Most organizations have focused on tooling up so their data analyst and scientists can more easily identify new insights. They have not invested however in similar tooling for running data workflows in production where you have to worry about starting, pausing and restarting jobs. You have to also worry about ensuring fault tolerance of your jobs, handle notifications, and orchestrating multiple workflows to avoid “collisions”.
- Solution: Here you could attempt to use Cloudera Navigator or Apache Atlas. They do some minimal tracking of data pipelines. But they really only report on lineage and don’t do anything to optimize your workloads for you. If pipeline A is dependent on Pipeline B finishing before it can complete its run, this is something that you would have to figure out yourself. Navigator and Atlas won’t do it for you. The alternative is usually hand-coding scripts and manually checking dependencies. Another possibility is running traditional enterprise scheduling tools which provide basic orchestration but not for managing 100’s of pipelines with different SLAs and dependencies. At the end of the day, you are either going to manage the pipelines mostly manually, using some of these tools as visual aids, or you will write a bunch of code yourself.
- Fortunately, Infoworks addresses this problem as well, providing a distributed orchestrator that monitors production workloads and makes them fault tolerant, reducing the load on system and production administrators.
The Bottom Line
The bottom line is that you don’t need nearly as much expertise as was required 5 years ago when Hadoop first started to get big. A first wave of automation came into existence about 3 years ago and automated individual slices of the big data pipeline from ingest to consumption. Infoworks represents a second wave that doesn’t just automate an individual slice, but automates the entire end-to-end data pipeline in a fully integrated fashion.
Regardless of whether you go with the first wave of automation, or what is now appearing as a second wave, you should not have to hand-code any of your big data pipelines either in development or in production. So if you find the majority of your big data effort turning into a coding effort in Python, Pig, Hive, Scala, etc, you are doing something wrong. The tools and platforms are now available that your existing data and business analysts should be able to achieve a relatively high level of self-service without having to become big data experts.