Big Data

Ad Hoc Data Wrangling vs Production Data Engineering and Integration

Posted by Todd Goldman

Self-service data analytics and data wrangling have been all the rage for the past few years.  The idea is that “citizen data scientists” and “citizen data analysts”, if just given the right tools, can be freed from the shackles of IT to do their own data analysis projects is very attractive.  They can access data in a data lake, and they can use ad hoc data integration and analytics tools to manipulate that data in a spreadsheet like interface that does smart things like normalize data formats, propose join conditions and make all kinds of other smart recommendations to help the user figure out how to integrate data in new and interesting ways.  And then once they have the “recipe” they want to apply to integrate the data, putting that recipe into a repeatable data pipeline is an exercise that is left to the user.

This is where production data engineering and integration come into play. For ad hoc analytics, you can use a sample data set.  For production, you have to be able to ingest the entire data set, which means you have to parallelize the loading of the data so it can fit within an SLA window.  For ad hoc analytics, you can just load the data once, for production analytics, you have to be able to handle incremental loading of data on an ongoing basis as new data comes into the system.  This means you have to deal with change data capture and slowly changing dimensions. These are non-trivial technical problems in the hadoop world because merging and syncing new data, as well as tracking history is not something that hadoop currently handles.

For ad hoc analysis, you don’t have to worry about how your data pipeline may interact with other data pipelines.  Who cares if you happen to load a data set from a relational database into the data-lake and you happen to do it several times.  But in a production environment, if you happen to have 5 data pipelines, all of which load the same table into the data lake, your operational environment better be smart enough to realize that it is loading the same table 5 times and either just load it once, or know to only load the most recent changes.

For ad hoc analytics, you don’t worry about restarting the data pipeline if it fails.  You just manually restart it. For production, your production system better be smart enough to know that something failed, try to restart it and inform the right people that it failed in the first place.

And lastly, when you are doing ad hoc data analysis, you don’t need to worry about tracking who made changes to the analytic recipe.  In general, one data scientist is working on a problem and once they find the recipe, they publish that recipe to be put in production.  However, once you switch to a production enviroinment, you want to know how often the pipeline was run, how long each run took, and who made changes to the pipeline logic if any changes were made.  Governance of the process becomes a big issue in an operational environment.

 The bottom line is that the tools that are designed for ad hoc self-service data analytics are NOT the same tools you want to think about for operationally running that environment at scale and vice versa.  They are different problems that have very different feature set priorities.  While there is some overlap, in that they both allow you to ingest and transform data, the similarities tend to stop there.  So make sure that you think about your needs and requirements for both environments because those requirements are significantly different.

About this Author
Todd Goldman