Is Continuous Integration compatible with database applications ?
Continuous integration and continuous delivery offers huge efficiency gains for companies but is continuous integration even possible when the application’s backbone is a massive relational database. How can one spin up database copies for developers, QA, integration testing, and delivery testing ? Its not like Chef or Puppet can spin up a 10TB database copy in a few minutes the way one can spin up a Linux VM.
There is a way and that way is called data virtualization which allows one to spin up that 10TB database in minutes as well as branch a copy of that 10TB from Dev to QA, or for that matter branch several copies and all for a very little storage.
Old methods of application project development and rollout have a solid history of failure to meet deadlines and budget goals.
Repeating the old methods and expecting different results is what Einstein would call insanity
“Insanity: doing the same thing over and over again and expecting different results.” – Einstein
Continuation Integration (CI), Continuous Delivery and Agile offer an opportunity to hit deadlines on budget with tremendous gains in efficiency for companies as opposed to waterfall methods. With waterfall methods we try to get all the requirements, specifications and architecture designed up front and then set the development teams working and then tackle integration and deployment near the end of the cycle. It’s impossible to be able to precisely target dates when the project will be completed and sufficiently QA’ed. Even worse during integration problems and bugs start to pour in further exacerbating the problems of meeting release dates on time.
Agile, CI and CD fix these issues, but there is one huge hurdle for most shops and that hurdle is getting the right data, especially when the data is large, into the Agile, CI and CD life cycle and flowing through that lifecycle.
With Agile, Continuous Integration and Continuous Delivery we are constantly getting feedback on where were are, how fast we are going and we are altering our course. Our course is also open to changing as new information comes in on customer requirements.
Agile development calls for short sprints, among other things, for adding features and functions and with continuous integration those features can be tested daily or multiple times a day. Further enhancing continuous integration is continuos delivery, for those systems that make sense, where new code that has passed continuous integration can be rolled into continuous delivery meaning testing the code deployment in a test environment. For some shops, where it makes sense such as famously flickr, Facebook, Google the code can be passed into continuous deployment into production.
By using agile programming methods and constantly doing integration testing one can get constant feedback, do course correction, reduce technical debt and stay on top of bugs.
Compare the two approaches in the graphs below.
In the first graph we kick off the projects. With waterfall we proceed until we are near completion and then start integration and delivery testing. At this point we come to realize how far from the mark we are. We hurriedly try to get back to the goal, but time has run out. Either we release with less than the targeted functionality or worse the wrong functionality or we miss the
With Agile and CI its much easier to course correct with small iterations and the flexibility to modify the designs based on incoming customer and market requirements.
With Agile and CI, code is tested in an integrated manner as soon as the first sprint is done so bugs are caught early and kept in check. With waterfall, since it takes so much longer to get working set of code working and integration isn’t even started until near the end of the cycle, bugs start to accumulate significantly towards the end of the cycle.
In waterfall, deployment doesn’t even happen until the end of the cycle because there isn’t an integrated deployable set of code until the end. The larger the project gets and the more time goes by the more expensive and difficult the deployment is. With agile and CI the code is constantly deployable and so the cost of deployment stays constant at a low cost.
A waterfall project can’t even start to bring in revenue until it’s completely finished but with agile, there is usable code early on and with continuous deployment that code can be leveraged for revenue early.
With all these benefits, more and more shops are moving towards continuos integration and continuos. With tools like Jenkins, Team City, Travis to run continuos integration test and virtualization technologies such as VMware, AWS, Openstack, Vagrant, Docker and tools like Chef, Puppet and Ansible to run the setup and configuration many shops have moved closer and closer to continuos integration and delivery.
But there is one huge road block.
Gene Kim lays out the top 2 bottlenecks in IT
- Provisioning environments for development
- Setting up test and QA environments
and goes on to say
One of the most powerful things that organizations can do is to enable development and testing to get environment they need when they need it.
Having worked with many enterprise organisations on their DevOps initiatives, the biggest pain point and source of wastage that we see across the software development lifecycle is around environment provisioning, access and management.
From an article published today in Computing [UK] we hear the problem voiced:
“From day one our first goal was to have more testing around the system, then it moves on from testing to continuous delivery,”
But to achieve this, while at same time maintaining the integrity of datasets, required a major change in the way Lear’s team managed its data.
“It’s a big part of the system, the database, and we wanted developers to self-serve and base their own development in their own controlled area,” he says.
Lear was determined to speed up this process, and began looking for a solution – although he wasn’t really sure whether such a thing actually existed.
This road block as been voiced by experts in the industry more and more as the industry moves towards continuos integration.
When performing acceptance testing or capacity testing (or even, sometimes, unit testing), the default option for many teams is to take a dump of the production data. This is problematic for many reasons (not least the size of the dataset),
Humble, Jez; Farley, David (2010-07-27). Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation (Addison-Wesley Signature Series (Fowler)) (Kindle Locations 7285-7287). Pearson Education. Kindle Edition.
What can we do about this enormous obstacle to Continous Integration of providing environments that rely on databases and these databases are too big and complex to provide copies for development, QA and continuos integration?
Fortunately for us there is data virtualization technology. As virtual machine technology opened the door to continuos integration, data virtualization swings it wide open for enterprise level application development that depend on large databases.
Data virtualization is an architecture (that can be encapsulated in software as Delphix has done) which connects to source data or database , take an initial copy and then and forever collects only the changes from the source (like EMC SRDF, Netapp SMO, Oracle Standby database). The data is saved on storage that has either snapshot capabilities (as in Netapp & ZFS or software like Delphix that maps a snapshot filesystem onto any storage even JBODs). The data is managed as a timeline on the snapshot storage. For example Delphix saves by default 30 days of changes. Changes older than the 30 days are purged out, meaning that a copy can be made down to the second anywhere within this 30 day time window. (some other technologies that address part of the virtual data stack are Oracle’s Snap Clone and Actifio).
Virtual data improves businesses’ bottom line by eliminating the enormous infrastructure, bureaucracy and time drag that it takes to provision databases and data for business intelligence groups and development environments. Development environments and business intelligence groups depend on having a copies of production data and databases and data virtualization allows provisioning in a few minutes with almost no storage overhead by sharing duplicate blocks among all the copies.
As a side note, but important, development and QA often require that data be masked to hide sensitive information such as credit cards or patient records, thus its important that a solution come integrated with masking technology. Data virtualization combined with masking can vastly reduce the surface area (amount of potentially exposed data) required to secure but eliminating full copies. Also data virtualization structure includes chain of authority where who had access to what data at what time is recorded.
The typical architecture before data virtualization looks like the following where a production database is copied to
In development the copies are further propagated to QA, UAT, but because of the difficulty in provisioning these environments, which takes number teams and people (DBA, storage, system, network, backup) , the environments are limited due to resource constraints and often the data is old and unrepresentative.
With data virtualization, there is a time flow of data states stored efficiently on the “data virtualization appliance” and provisioning a copy only takes a few minutes, little storage and can be run by a single administrator or even run as self service by the end users such as developers, QA and business analysts.
With the ease of provisioning large data environments quickly easily and for low resources, it become easy to quickly provision copies of environments for development and to branch in minutes those copies into multiple parallel QA lanes to enable continuous integration:
The duplicate data blocks, which is the majority in the case of large data sets, can even be shared across development versions:
to Continuous Integration