Archive

Archive for September, 2013

Lab128 : secret DBA performance tool

September 12th, 2013


 

12-04-46-44_950x510

I’ve mentioned lab 128 before but I have to mentioned it again. Every DBA should know about this tool. It’s the fastest way to connect to an Oracle database and get some hard core performance information nicely visualized graphically.

Lab 128 starts up in seconds, like 1 or 2 seconds. Its super fast and light weight but powerful.  My older version only took 10M of RAM. The newest version is about 47M but still as fast as ever.

Lab128 can be used to monitor 15-20 instances. Lab 128 keeps  13-14 hours of performance data in memory, and last 60 days on the disk. Monitoring that many instances for that long takes around 800-900 MB of RAM and well less than  10% of the CPU on a laptop.

Below is my favorite screen, the “Activity Explorer” page which is somewhat like the top Activity screen in OEM but it shows both the data from ASH and the data from wait events, so you can compare the two. The ASH data can be collected by the tool avoiding the diagnostics and performance pack license or if you have the license it can connect to ASH from the database views.

You can monitor several databases at a time

Here is the top session screen showing session state, session types and for active queries, the estimated completion % for the query.

Cool maraige of  unix top and v$sessions:

Lab128 Top Processes window

buffer cache explorer

Lab128 Buffer Explorer window

lock monitor

Lab128 Locks and Locked Objects window

  • Alert File Viewer
  • 10053 trace viewer
  • Clusterware monitor

Much more.

Check out the website at http://www.lab128.com

 

Uncategorized

Confio Ignite

September 11th, 2013


Catch Confio at Oaktable World Sept 23,24 in San Francisco

People have been asking me recently, “what is the best enterprise database monitoring software?”

Of course for Oracle there is OEM but what if OEM doesn’t find the problem? What if OEM breaks? ( I’ve blog before on how OEM can break and all the kings men don’t seem to be able to make him work again) What if one wants to access the performance data but the database is down? (Grid control only has an anemic modified and transformed extract from AWR). What if only DBAs have access to OEM and the developers want access to performance information and not only access but safe access in a user friendly and manager friendly interface? (I’m a strong believer in giving OEM access to developers but unfortunately the OEM interface is setup as a DBA interface with potentially risky access to database actions and is not a safe read only developer browsing interface). What if one wants to monitor multiple database platforms in the same interface?

In enterprise database monitors I look for a dashboard based on wait time and CPU statistics.  I want to see CPU and wait statistics correlated in a way that I can easily see the load on the database and the break down of the time by wait class and CPU. The main databases are instrumented with wait events such  as Oracle, Sybase, SQL Server and DB2 (DB2 wait interface support is in 9.7, before 9.7 the statistics were less useful).

I only know of 3 cross platform enterprise database monitoring products that follow such a methodology:  Confio Ignite, Quest Performance Analyzer and Precise Indepth/I3.

Precise Indepth/I3 product has fallen off my radar. I haven’t heard much from them and the last time I talked to them they said that installing the product required a consultant. That’s a show stopper for me. Quest’s as well is falling off my radar, not for lack of technology but more a lack of focus by the company. The product “Performance Analyzer” doesn’t even show up in the top 10 hits on Google. My guess is that they have rolled the product under the hood of Foglight and sell it as a Foglight option, which  means more money and more complexity which are all drawbacks. As far as technology goes, “Performance Analyzer” was pretty cool and had nice dash boards but probably the biggest drawback was the product required binaries to be installed on each and every target (at least for Oracle) which can turn into a maintenance nightmare. Well, who else is out there? The other contender is Confio. One of Confio’s great advantages is that they only do (at least for most of their history) monitoring.  There is clear focus and enthusiasm. It’s such a refreshing change from Quest and Precise (and if you are Oracle only, then OEM).

So let’s look at Confio. One of the newest and most exciting things at Confio is the feature of monitoring VMware statistics. VMware has a hundreds of statistics ( I once ran esxtop in batch mode and got 24,000 columns, yes 24,000). The statistics are shown in nice graphs in ESX vSphere and vCenter, but the statistics in the graphs have to be chosen from lists and the number of lines in the graphs can become overwhelming, but the worst part is the lack of correlation between statistics of different types such as CPU, I/O and network which are on different graphs. Finally there is no way to correlate the VMware statistics with the Oracle database, until now with Confio.

Here is a Confio enterprise dashboard where I can see my databases. The databases are grouped in this image by those on VMware and those not:

If I drill down onto a database running on VMware I get not only the classic load chart on the database but also correlated graphs from the OS  and VMware:

The top graph is the load on the Oracle database grouped by top SQL based on active time in the database (which includes wait time and CPU. Below this graph are 3 other graphs:

  1.  DB Instance
    • Signal Waits Percent
    • Instance CPU Utilization
  2. VM/OS
    • VM CPU Usage
    • VM CPU Ready Time
    • O/S CPU Queue Length
  3. Physical Host
    • Host CPU Usage

Now with these graphs on the same page I can easily make correlations.  I can see a spike in my SQL load at noon.

I can correlate this spike in database load with the three graphs.

1. The “DB Instance” graph shows a spike in  the CPU used by Oracle as seen in the “DB Instance” graph. The “DB Instance” graph also shows “Signal Waits Percent” which is a fancy way of saying Oracle is waiting for CPU – ie there is CPU contention.

2. The “VM/OS” graph shows CPU usage going up and “CPU Ready Time” going up. The statisitics “CPU Ready Time” is an important statistics for VMware, yet it’s not well documented. The statisitc “CPU Ready Time” is shows how much time the VM wanted CPU from the host but couldn’t get it.

3. The “Physical Host” graph shows that there was a spike in CPU used at the HOST level across all the VMs on that host.

Additionally there are event notifications of changes on the system such as adding a new VM to the host. Note  the grey circles with arrows. Pass your mouse over the event icon to get information about the event.

Grey circles are events on other VMs, blue circles are events on this VM.

I find the ability to see all the layers in one view, see the important statistics only and be able to correlate all these statistics invaluable.

On top of the additional VMware monitoring option as seen anbove, Confio offers the classic view of the database load view through different aggregation  groupings :

  • SQL
  • Waits
  • Programs
  • Machines
  • DB Users
  • O/S Users
  • Files
  • Plans
  • Objects
  • Modules
  • Actions

Clicking the Object tabs gives a different perspective

The above charts are  large granularity but one can zoom down to as small as 10 second intervals:

Ignite also notifies when it finds performance issues:

Drilling down on alerts will point out such useful things as a SQL statement that has had an execution plan change for the worse:

Summary

The above are a few of my first impressions of Confio’s Ignite. Ignite seems to fill a clear need in the industry for enterprise cross database platform monitoring including the unique additional ability to monitor VMware.

If you are on Oracle only, then it is a cheaper alternative to OEM and if you have OEM already then Ignite is a good complement. One attractive feature of Ignite is that all the data  is collected into a centralized database allowing one to easily run custom queries and query across multiple databases. Most importantly Ignite gives safe access to database data to managers and developers – the people who should actually be seeing and understanding database performance.

You can take Confio Ignite for a spin at:

http://demo.confio.com:8143/iwc

login as demo, demo

PS: please share any experiences you have had with the product in the comments. Thanks

 

woman-163426_640

Uncategorized

Method-R Tools

September 10th, 2013


Concept of Hand with Electronic FingerprintsIt’s rare to find professional level tools  that are written by the people who use them. Most professional tools are written by teams of people who have little idea of how they are used and many users write tools that never reach a professional level. Thus it is a  pleasure to see the tools put out by Method R  with Cary Millsap and Jeff Holt.  Jeff and Cary have been well known in the industry for their work and contributions to the world of Oracle performance tuning. With Method R tools  they not only share their know how and  tools they use to tune systems, but they have polished them  into solid, clean , tight professional level tools that are dependable, high quality and written with integrity.

See: Pythians Blog on Method-R tools

Method R Tools

  1. MR Trace ($49.95) – automatically creates, collects and manages trace files for queries run in Oracle’s  SQL Developer using extended tracefile options targeted at collecting pertinent performance information for optimizing SQL statements.
  1. MR Tools (starting at $397)

mrskew – your trace file profiler and data miner, the core MR Tools product, check out Karl Arao’s examples

other tools

  • mrls – trace file lister, examples, list trace files sorted by  LIO or sorted by rows returned
  • mrcallrm – trace file cropper – retaining required info and filtering undesired

Trace file timings:

  • mrtim – timestamp converter
  • mrtimfix – your trace file tim value fixer, fix 11gR1 bug in trace files & change timing to UNIX epoch
  • mrnl – pre-pend lines in trace with line #, time, duration and other options as well
  1.  Method R Profiler (starting at $1,180) 

Is like the gold standard for the Method R tools . The profiler takes all the know how and analysis and wraps  it up in a clean crisp html UI that makes for easy browsing , exploration and analysis

I look forward to blogging soon on some of the innovative and powerful features of Method R Profiler

 

A quick summary of things in the report

 

1. Task-Level Profiles
1.1 Profile by subroutine
   subroutines:
     events like:
      db file sequential read
      db file scattered read >
   spark line
   table , each line is a bucket between 1us - 1000 secs, orders of 10
      call count
      total seconds
2.2 next subroutine

3. Contribution to Subroutine by Statement
3.1 Statement Contributions to <>
     statement text (with sql hash)
       seconds duration
       count
       mean
       min
       max
     drill down to stats
3.2 next subroutine

4. Statement Detail
4.1 Statement Report for <>
   Oracle hash value:
   Full statement text: drilldown
   Re-use: # distinct text
   Text size: x lines, y bytes
   # of distinct plans: X
   Optimizer goals:
   response time contribution:
4.1.1 Profile by Contribution to Parent Cursor
4.1.2 Profile by Subroutine
     waits, CPU, time by
     seconds, count, mean, min, max, skew sparkline
4.1.3 Profile by Database Call
      for
        FETCH
        beween calls
        EXEC
        PARSE
        cursor close
     show
       seconds
       CPU
       rwos processed
       bufer cache access total
         CR
         CU
      PIOs
      Library cache misses
4.1.4 Execution Plans and Placeholder Values
      Execution plan
        seconds for row !
        seconds including descendents
        rows returned
        CR LIOs
        PIOs read
        PIOs writes
4.1.5 Statement Text
4.2 ... next statement ...
5. Configuration information

Uncategorized

Block Change Tracking setup and overhead

September 9th, 2013


tea plantation

For everything you ever wanted to know about Block Change Tracking (BCT) and were afraid to ask, check out the awesome presentation from Alex Gorbechev

Here are some quick notes, mainly for my own reference, on BCT. BCT can be enabled while the database is up. Turning on BCT requires giving a tracking file location which can also be done while the database is up. Changing the BCT tracking file location after it has already been set requires bouncing the database. Finally, with BCT enabled we can look at the upper bound size for a database incremental backup. The actual size could be the same size as the upper bound or as small as 1/4 of the upper bound size as BCT only tracks groups of 4 blocks at a time for changes.

Screen Shot 2013-09-05 at 9.52.45 PM

select status from v$block_change_tracking;
-->  DISABLED
alter database enable block change tracking;
-->  ERROR at line 1:
-->  ORA-19773: must specify change tracking file name
alter system set db_create_file_dest='/oradata/orcl/' scope=both;
alter database enable block change tracking;
select * from v$block_change_tracking;
-->   STATUS     FILENAME                                                      BYTES
-->   ---------- -------------------------------------------------------- ----------
-->   ENABLED    /oradata/orcl/ORCL/changetracking/o1_mf_6ybbfdjv_.chg     11599872
Screen Shot 2013-09-05 at 9.52.45 PM
The following query is based on original query from Alex Gorbechev to give the number of blocks to read during a level 1 incremental backup. It seems this should also be upper bound on the size of an incremental database level 1 backup. Alex’s original query was based on the reads incurred by one datafile. This query attempts to look at the whole database.
Screen Shot 2013-09-05 at 9.52.45 PM
--  from Alex Gorbechev ( I believe the 32 refers to 4 *8K blocks size so if your block size is different you'll have to change this )
SELECT (count(distinct b.fno||' '||bno) * 32)/1024 MB
  FROM x$krcbit b,
    (SELECT MIN(ver) min_ver, fno
       FROM
         (SELECT curr_vercnt ver,
                 curr_highscn high,
                 curr_lowscn low,
                 fno
          FROM x$krcfde
              UNION ALL
          SELECT
                  vercnt ver,
                  high,
                  low,
                  fno
          FROM x$krcfbh )
          WHERE (SELECT MAX(bd.checkpoint_change#)
                 FROM   v$backup_datafile bd
                 WHERE  bd.incremental_level <= 1)                  between low and high            GROUP BY fno      ) sub  WHERE b.fno = sub.fno AND b.vercnt >= sub.min_ver
/
--> 960
Screen Shot 2013-09-05 at 9.52.45 PM

Alex’s original query:

Screen Shot 2013-09-05 at 9.52.45 PM

SELECT count(distinct bno) * 32
FROM x$krcbit b
WHERE b.fno = 7 AND b.vercnt >=
(SELECT MIN(ver) FROM
(SELECT curr_vercnt ver, curr_highscn high, curr_lowscn low
FROM x$krcfde WHERE fno = 7
UNION ALL
SELECT vercnt ver, high, low
FROM x$krcfbh WHERE fno = 7)
WHERE (SELECT MAX(bd.checkpoint_change#)
FROM v$backup_datafile bd
WHERE bd.file# = 7
AND bd.incremental_level <= 1) between low and high);

Screen Shot 2013-09-05 at 9.52.45 PM

Running incremental backups for a while it’s possible to collect historical ration between number of blocks read and number and size of the backup. This would as well account for compression.
Note that the query above is just an example and it has the following limitations:

  • Chunk size is hard coded to 32K (could it vary on different platforms?)
  • First block overhead is not accounted for
  • No special case when required bitmap version is not available (purged) and the whole datafile must be read
  • No case with backup optimization for level 0 (v$datafile_backup.used_optimization)
  • No case when no data blocks in datafile is changed (no bitmap version but the first block must be backed up anyway)
  • Only single datafile
  • No accounting for unavailable base incremental backup

Screen Shot 2013-09-05 at 9.52.45 PM

I’ve been wondering what kind of impact BCT has on databases and had the opportunity to talk to Vinay Srihari.*

Kyle: Vinay, what kind of impact does BCT have on the performance of an Oracle database?

Vinay: After it was introduced in 10gR1, BCT was put through TPCC-style OLTP benchmarks internally at Oracle. There was no measurable performance overhead on OLTP. However, it took a release to stabilize the feature and fix several fundamental bugs: 10.2.0.4 is all right, not so sure about 10.1.

There was some concern because BCT updates to the in-memory bitmap are done in the critical redo generation path by every foreground. The BCT logging turned out to be negligible in terms of CPU (instruction count) because redo generation was an order of magnitude more work. The main issue was bugs that caused the BCT background (CTWR) to get stuck or slow down, to the point where no more in-memory bitmap buffers were available and impacted foreground activity.

BCT for the Physical Standby was added in 11.1, as part of the Active Data Guard licensed option. There were some critical bugs with this feature that were fully resolved only in 11.2, although patches are available for 11.1.0.7.

I have seen BCT deployed successfully at customers with > 20MB/s (50 GB/hr), Oracle’s internal testing was ~50MB/s.

Bottom line, stability of the feature and not performance is the concern. On the primary: 10.2.0.4, 11.1.0.7, 11.2 are solid. On the standby: 11.1.0.7 (+ BCT patches), 11.2.0.1 (+ BCT patches), 11.2.0.2 are best.

* Vinay is currently an architect at Delphix and spent the past  12 years  at Oracle where he was involved in building High Availability services for the Oracle Database kernel. He was a member of early development teams that delivered RAC Cache Fusion, multi-threaded redo generation, and high performance replication. Since 2005 he was managing the High Availability infrastructure engineering team responsible for the development of the recoverability features of the core database and high-revenue products like Active Data Guard Standby, RAC, and Exadata. Prior to that Vinay also had stints in Oracle’s technical support and server performance organizations.
Concept of Hand with Electronic Fingerprints

Uncategorized

Data Control

September 9th, 2013


Video of webinar given on Data Control by myself and James Murtagh of Red Gate




 

 

Screen-Shot-2013-09-05-at-5.26.06-PMBusinesses today face a difficult conundrum: industry giants such as Google, Facebook, and others have taught people to expect service that’s reliable and fast, yet budgets are getting tighter by the fiscal year. Instead of being able to provide quality, on-time applications that are bug free and high performance, struggling IT departments are forced to cut back on systems and resources for developers.

The most vital resource for applications can also be one of the most cumbersome: the database. Data is the backbone to our most important applications, and as such it is stockpiled and maintained in a robust database engine—yet the contents can become quite large and unruly. This makes data difficult to manage, particularly when mobility is required. A lack of data mobility can cause severe issues in the lifecycle of a product as it impacts good QA and development.

This problem has become so widespread in fact that development teams have turned to new measures in order to complete their work despite bottlenecks in data management. These measures are collectively summed up as the DevOps movement.

From Wikipedia: DevOps (a portmanteau of development and operations) is a software development method that stresses communication, collaboration and integration between software developers and information technology (IT) professionals. DevOps is a response to the interdependence of software development and IT operations. It aims to help an organization rapidly produce software products and services.

One of the most important aspects of DevOps is release management. Standardized development environments are becoming more common and tools like git and SVN are almost universally usable to ensure a quality release and deployment process. This facilitates the goals of fast development, quality assurance, and rapid feature/maintenance release schedules. There’s only one problem with this: the database doesn’t play along. Yet.

Databases have no release-based data control like developers have for code. With code it is easy to have a trunk with branches, merge in changes, and preserve integrity with a check-out and check-in process. In fact, most databases used for development purposes are not only woefully behind production in terms of data freshness, but also are used for development across multiple releases without any possibility of identifying how the data looked or should look at the time of release. The database is just there—a giant glob of data that doesn’t follow the same rules as the code and often gets in the way of an otherwise efficient process.

In the database world, it’s common to hear the idiom “Cheap, fast, or good – pick two.” Some might even say you can only pick one . In all, the outlook is very bleak.

Here is what we at Delphix are seeing in the industry:

  • Inefficient QA ends up costing companies thousands or millions as bugs make their way into production.
  • QA delays mean extra work for project managers, QA teams, and IT teams as people scramble to find and fix issues and manage project deadlines.
  • Developers and QA teams sharing databases causes bottlenecks that result in difficult coding and testing practices and slower time to delivery.
  • Creating development or QA databases from subsets of production data (usually due to size) results in bugs due to impractical data volumes.
  • Slow provisioning and infrastructure rollout means delays that impact an application from the very foundation on up and result in massive schedule overruns.

Let’s look at these issues in a bit more detail.

 

 1. Inefficient QA

Screen Shot 2013-09-05 at 5.26.27 PM

One Delphix customer we were talking to spent 94% of every QA cycle building the QA environment and only 6% actually running the QA testing suite. This cycle was repeated every single time a set of features required testing and had to happen for every QA suite. The long pole in the environment build was provisioning the database the code had to run on. The burden was so overwhelming it completely monopolized their allotted time.

 

2. QA Delays make fixing bugs more expensive

Screen Shot 2013-09-05 at 5.38.40 PM

When building a QA environment takes a long time the integrity of the application suffers. In the above case, if only 6% of QA time is spent on actual testing it is very easy for bugs to go unnoticed. When this happens there can only be two possible outcomes: either the bug finally does get caught in QA but due to the delay in bug capture it requires more code or testing to roll back and fix, or the bug doesn’t go noticed and makes its way into production where it can cost customer confidence or large sums of revenue. The longer it takes for a problem to be found and fixed, the more cost goes up.

Screen Shot 2013-09-05 at 5.55.14 PM

 

3. Using subset databases causes bugs

Screen Shot 2013-09-05 at 5.27.23 PM

 

Screen Shot 2013-09-05 at 5.27.31 PM

For most companies it is impossible to give all developers their own full size copy of the production database (at some companies, they can’t even provide one full size copy). As such, they will create subset copies of production for development and QA purposes. This takes a huge amount of time to write and maintain scripts to create data subsets and verify data coherency and referential integrity; in fact, one customer of ours estimates that 50% of their DBA time is dedicated to creating subset database clones. And because the data is not representative of production data, it often leads to the introduction of bugs. When the QA environment is similarly crippled, it may be impossible to isolate and verify a fix to the bug once it is caught. And with a subset, there will be bugs—another customer of ours estimated that 20% of their production bugs were due to using subsets.

 

Queries that run fine on subsets either run into new anomalies on production data, or hit the wall as performance of a query tanks when run on a full dataset size.

 

4. Development teams sharing on full copy leads to bottlenecks

Screen Shot 2013-09-05 at 5.27.42 PM

An alternative to subsets (or sadly, sometimes used in combination) is having the development team share a single copy of the production database. Sharing a full copy requires that any changes to the database schema, content, or metadata pass through code review to ensure that the changes won’t break the work of other developers. Instead of developers being able to perform their own work quickly and efficiently and merge it into the final release, they are forced to work in a synchronous manner that wastes time. A large online auction site we talked to said this review cycle took one to two weeks before a developer could get changes into a shared database copy. In today’s fast paced world that is simply unacceptable.

Because so many people are using the shared resource and because that resource is so big and time consuming to build, it can be a very long time between refreshes of the data. This leads to old unrepresentative data, which incurs its own set of risks every bit as dangerous as data subsets.

Screen Shot 2013-09-05 at 5.58.58 PMScreen Shot 2013-09-05 at 5.59.10 PM

 

 5. Slow environment builds and the culture of “NO”

Screen Shot 2013-09-05 at 5.27.49 PM

A huge IT hurdle to tackle is creating development environments in the first place. Data is growing faster than our infrastructures are improving, and the size of our databases means that cloning or refreshing environments can take days or even weeks to accomplish. Because development environments are incredibly dependent on real-time data, this results in lost productivity and reliability. One large global bank told us that it generally took 6 months to build their environments out. Another customer who used NetApp storage snapshots said it still took them 2-4 days due to their system of entering tickets for requests and then trying to schedule time and resources from the storage admin, system admin, and DBA teams. All of this hassle leads to a culture of “NO”, where developers and other front-line product creators are simply told that meeting their needs is not possible. Sadly, many developers are used to the idea that if they need a copy of the data they will be told “NO” in most cases.

 

Can I get a copy ?  No !                         Result: Developer motivation

No     forlorn cat

Data is the Problem

 

Getting the right data to the right people at the right time and in the right form is the problem. The problem is data agility—or more to the point, a lack thereof.

A lot of people think they understand the extent of their data problems. But let’s go through what we’ve learned from our customers in the industry:

  • 96% of QA time is spent on building environments instead of running QA tests
  • 90% of lost development days attributed to waiting for development environments
  • 50% of DBA time spent making database copies
  • 20% of bugs because of using subset data instead of full copies

Budget cuts put pressure on IT teams which limits system resources, storage, and personnel. These limits reduce the number of systems available and increase the time to provision these resources. These limits impact developers and contractors that work on these systems leading to project delays and reduced quality.

 

The Total Cost of Data

Data is expensive. In most calculations, the total cost of data only includes storage costs; however, storage costs are just the tip of the iceberg. For every GB or TB of storage there are teams of storage admins, system admins, and DBAs who have to manage the storage, move data, backup data, and provision clones for development and other needs. The more data there is to move, the longer it takes and the more expensive it is due to the personnel involved, schedule overruns, and time spent discussing the requirements and tasks in meetings. The total cost of data goes far beyond simple storage costs.

 

Delphix Solves the Data Problem

 

I’ve written extensively on Delphix elsewhere. You can see some of the more pertinent cases here:

In a nutshell, Delphix makes creating clone copies so fast that it is nearly unimaginable to even the most modern of IT teams. These clones can be provisioned in only a few minutes no matter what the source database size and require almost no storage overhead because duplicate blocks are shared across all clone copies. Because the clones are fast and almost free they can be provisioned for all the developers; in fact, a developer can easily be given multiple copies of the full production database. The clones can also be made fresh, just a few minutes behind production, in a matter of minutes.

When this technology was explained to one company we work with, the overwhelming response was “No way.” When the first clone of a 7TB database was created in 7 minutes and a second clone (from a time period in the past, no less) was created in an additional 10 minutes the response was, “No freaking way.” The implications of having two full-sized fresh clones to any point in time in a matter of minutes with no additional storage overhead were almost impossible to comprehend.

 

 

Database Clone Copies with Delphix: Full , Fresh, Fast, Free

Screen Shot 2013-09-05 at 9.19.25 PM

 

What the Industry can accomplish with Delphix

 

  1. Efficient QA: Low cost, high utilization
  2. Quick QA: Fast bug fixes and problem resolution
  3. Developer Agility: Everyone gets a database for parallelized development
  4. Full DBs: Less bugs in development, QA, and production
  5. Fast builds: Fast development and the culture of Yes

 

1.Efficient QA

Screen Shot 2013-09-05 at 8.55.02 PM

With Delphix, QA environments can be spun up in minutes and branched directly from the development database.

 

2.Quick QA

 

Screen Shot 2013-09-05 at 8.55.10 PM

Because QA environments can be spun up quickly, QA can test immediately at the end of a sprint to find bugs quicker and reduce the time and work needed to fix those bugs.

 

3.Developer Agility

 

Screen Shot 2013-09-05 at 8.55.31 PM

With Delphix, each developer can have one or more full copies of a database and can refresh with new data in minutes. Developers will no longer be blockers to each other because of a single point of wait.

 

4.Full DBs

 

Screen Shot 2013-09-05 at 8.55.38 PM

With Delphix, the development databases are full copies so there are no longer bugs due to small and inadequate representations of real data.

 

5.Fast Builds

 

Screen Shot 2013-09-05 at 9.27.17 PM

With Delphix, developers can provision their own clone copies in minutes without having to coordinate with storage admins, system admins and DBAs.

 

Dev Ops Management with Delphix

 

A big question is: If each developer has their own database, how do we manage the merging of their changes into a release? This is where Red Gate Schema Compare comes into play. For example, consider this process:

  1. Delphix: Create 3 Virtual Databases
    1. DEV1
    2. DEV2
    3. TRUNK
  2. Have two developers perform their work
    1. Developer 1 works on DEV1
    2. Developer 2 works on DEV2
  3. Red Gate Schema Compare performs merges
    1. DEV1 changes move into TRUNK
    2. DEV2 changes move into TRUNK

 

This process can be easily summed up in a visual flow:

 

 Screen Shot 2013-09-06 at 2.44.24 PM

 

From a timeline point of view, the process is as simple as it is agile:

Screen Shot 2013-09-05 at 9.32.36 PM

 

Quality Assurance can easily be a part of this process. The steps are more robust, but with near instant cloning it becomes trivial while remaining thorough:

Screen Shot 2013-09-05 at 9.32.42 PM

A few last words about Developers, DBAs and DevOps

 

 

 

Uncategorized

Finding Oracle Home

September 6th, 2013


business vision

Once in while I work on systems that are new to me and no one is around to explain the configuration to me. I can see databases running and their SIDs via the process names, but when there are multiple ORACLE_HOME directories possible it can be difficult to even connect to these databases because connecting locally requires the SID and ORACLE_HOME. In these cases, I’ve often wanted a way to clearly figure out which running database is using which ORACLE_HOME, so it was fun to see Tanel Poder’s recent posted on how to do this on LINUX and SOLARIS. Here is a quick summary of LINUX and SOLARIS plus HPUX and AIX as well:

Solaris and Linux

$ pgrep  -lf _pmon_
12586 ora_pmon_ee10204

ORACLE_SID is ee10204

$ pwdx 12586
12586: /u01/oracle/10.2.0.4/ee1/dbs

ORACLE_HOME is /u01/oracle/10.2.0.4/ee1
HPUX
$ ps -ef | grep pmon
ora1024 25611     1  0  Mar 21  ?         0:24 ora_pmon_itanic10

ORACLE_SID is itanic10

$ pfiles 25611 | grep  bin
25611:                  /opt/ora1024/app/db_1/bin/oracle

ORACLE_HOME is /opt/ora1024/app/db_1

AIX

$ ps -ef | grep pmon
ora1024   262314        1   0   Mar 23      -  0:12 ora_pmon_mercury

ORACLE_SID is mercury

$ ls -l /proc/262314/cwd
lr-x------   2 ora1024  dba  0 Mar 23 19:31 cwd -> /data/opt/app/product/10.2.0.4/db_1/dbs/

ORACLE_HOME is /data/opt/app/product/10.2.0.4/db_1

MP910216359

Uncategorized

Disk file operations I/O

September 5th, 2013


22-38-37-706_950x510

Database is getting high waits on buffer busy waits. Here is an example period where 5 sessions are completely blocked on buffer busy waits for 4 minutes:

  select
       to_char(min(sample_time),'HH24:MI') maxst,
       to_char(max(sample_time),'HH24:MI') minst,
       count(*),
       session_id,
       ash.p1,
       ash.p2,
       ash.SQL_ID,
       blocking_session bsid
  from DBA_HIST_ACTIVE_SESS_HISTORY ash
  where event='buffer busy waits'
     and session_state='WAITING'
group by
       session_id
       ,sql_id
       ,blocking_session
       ,ash.p1
       ,ash.p2
Order by  1
/

gives

MAXST MINST   COUNT(*) SESSION_ID         P1         P2 SQL_ID           BSID
----- ----- ---------- ---------- ---------- ---------- ------------- -------
21:54 21:58         26       1526       1812     278732 3gbsbw6w8jdb3      11
21:54 21:58         25        528       1812     278732 3gbsbw6w8jdb3      11
21:54 21:58         25       1514       1812     278732 3gbsbw6w8jdb3      11
21:54 21:58         26        777       1812     278732 3gbsbw6w8jdb3      11
21:54 21:58         25         33       1812     278732 3gbsbw6w8jdb3      11

All are waiting on the same file and block held by one session and all are executing the same statement which is a select for update

The blocker is executing the same SQL statement and spends those 4 minutes waiting for “Disk file operations I/O”:

select
       min(to_char(sample_time,'HH24:MI')) minst,
       max(to_char(sample_time,'HH24:MI')) maxst,
       session_id,
       substr(event,0,30) event,
       count(*),
       ash.SQL_ID,
       blocking_session bsid
from DBA_HIST_ACTIVE_SESS_HISTORY ash
where
       session_id in  ( 11)
group by
     event,sql_id,session_id, blocking_session
order by 1
/

gives

MINST MAXST EVENT                            COUNT(*) SQL_ID           BSID
----- ----- ------------------------------ ---------- ------------- -------
21:54 21:58 Disk file operations I/O               26 3gbsbw6w8jdb3

What are “Disk file operations I/O”?

From the docs , http://docs.oracle.com/cd/E18283_01/server.112/e17110/waitevents003.htm#insertedID40

Disk file operations I/O

This event is used to wait for disk file operations (for example, open, close, seek, and resize). It is also used for miscellaneous I/O operations such as block dumps and password file accesses.

FileOperation Type of file operation
fileno File identification number
filetype Type of file (for example, log file, data file, and so on)
What kind of FileOperations and filetype are occuring:
select p1,p3, count(*) from
dba_hist_active_sess_history
where event ='Disk file operations I/O'
group by p1,p3
/
        P1         P3   COUNT(*)
---------- ---------- ----------
         2          1        193
         2          3         14
         4          4          1
         2          2       4459
         3          4        160
         1         18        103
So mainly FileOperation type 2 and filetype 2.
What are the file types? Not sure, but thanks to Andy Klock on Oracle-L  this looks like a possibility:
select distinct filetype_id, filetype_name from DBA_HIST_IOSTAT_FILETYPE order by 1;

FILETYPE_ID FILETYPE_NAME
----------- ------------------------------
0 Other
1 Control File
2 Data File
3 Log File
4 Archive Log
6 Temp File
9 Data File Backup
10 Data File Incremental Backup
11 Archive Log Backup
12 Data File Copy
17 Flashback Log
18 Data Pump Dump File

What about FileOperation=2? After a call to Oracle support, it looks like on this version, 11.2.0.3 the values are:

    1 file creation
    2 file open
    3 file resize
    4 file deletion
    5 file close
    6 wait for all aio requests to finish
    7 write verification
    8 wait for miscellaneous io (ftp, block dump, passwd file)
    9 read from snapshot files

putting this together gives a script like:

ol file_type for a20
col file_operation for a20
select
    decode(p3,0 ,'Other',
              1 ,'Control File',
              2 ,'Data File',
              3 ,'Log File',
              4 ,'Archive Log',
              6 ,'Temp File',
              9 ,'Data File Backup',
              10,'Data File Incremental Backup',
              11,'Archive Log Backup',
              12,'Data File Copy',
              17,'Flashback Log',
              18,'Data Pump Dump File',
                  'unknown '||p1)  file_type,
    decode(p1,1 ,'file creation',
              2 ,'file open',
              3 ,'file resize',
              4 ,'file deletion',
              5 ,'file close',
              6 ,'wait for all aio requests to finish',
              7 ,'write verification',
              8 ,'wait for miscellaneous io (ftp, block dump, passwd file)',
              9 ,'read from snapshot files',
                 'unknown '||p3) file_operation,
    decode(p3,2,-1,p2) file#,
    count(*)
from dba_hist_active_sess_history
where event ='Disk file operations I/O'
group by p1,p3,
    decode(p3,2,-1,p2)
/

with output like

FILE_TYPE            FILE_OPERATION            FILE#   COUNT(*)
-------------------- -------------------- ---------- ----------
Control File         file open                     0        193
Data File            file open                    -1       4460
Archive Log          file deletion                 0          1
Log File             file open                     0         14
Data Pump Dump File  file creation                 0        103
Archive Log          file resize                   8        160

The “-1″ for datafiles is to group all the datafiles in one line, otherwise in the above case there were over 200 lines of output
For the datafiles what are the I/O latencies looking like compared to the Disk file operations I/O ?

BHOU EVENT_NAME                        AVG_MS           CT
---- ------------------------------ --------- ------------
1054 Disk file operations I/O            2.00       13,547
1130 Disk file operations I/O            1.52       10,658
1200 Disk file operations I/O            1.57        9,846
1230 Disk file operations I/O            2.45        8,704
1300 Disk file operations I/O            3.84        9,526
1330 Disk file operations I/O            2.39       11,989
1400 Disk file operations I/O            1.68       14,698
1430 Disk file operations I/O            2.89       14,863
1500 Disk file operations I/O          860.85       10,577
1530 Disk file operations I/O           12.97       11,783
1600 Disk file operations I/O          623.88       10,902
1630 Disk file operations I/O          357.75       12,428
1700 Disk file operations I/O          294.84       10,543
1730 Disk file operations I/O           12.97       10,623
1800 Disk file operations I/O          461.91       14,443
1830 Disk file operations I/O           12.83       18,504
1900 Disk file operations I/O          443.37        9,563
1930 Disk file operations I/O          237.39       11,737
2000 Disk file operations I/O          542.44       13,027
2033 Disk file operations I/O            6.11        8,389
2100 Disk file operations I/O           16.85       10,561
2130 Disk file operations I/O          306.17        9,873
2200 Disk file operations I/O           20.83       11,335
2230 Disk file operations I/O           12.92       10,158
2300 Disk file operations I/O           13.42       11,025
2330 Disk file operations I/O           15.01       10,883
0000 Disk file operations I/O            5.33        8,533
1054 db file scattered read              1.50       92,394
1130 db file scattered read              1.33       73,243
1200 db file scattered read              1.82      122,988
1230 db file scattered read              2.53      255,474
1300 db file scattered read              4.26      288,144
1330 db file scattered read              2.47      308,045
1400 db file scattered read              2.60       91,684
1430 db file scattered read              3.56      176,324
1500 db file scattered read              4.95      621,658
1530 db file scattered read              5.11      227,565
1600 db file scattered read              5.86      472,804
1630 db file scattered read              9.44      224,984
1700 db file scattered read              9.40      165,238
1730 db file scattered read              7.78      349,003
1800 db file scattered read              6.93      252,761
1830 db file scattered read              7.79      151,760
1900 db file scattered read              5.48      165,369
1930 db file scattered read              3.09      200,868
2000 db file scattered read              3.45      136,647
2033 db file scattered read              5.17      136,330
2100 db file scattered read             11.16      103,799
2130 db file scattered read             10.44      118,025
2200 db file scattered read             20.02      127,638
2230 db file scattered read             13.66      157,210
2300 db file scattered read             10.95       98,493
2330 db file scattered read              8.39      149,606
0000 db file scattered read              4.16      230,075
1054 db file sequential read             3.04    1,152,102
1130 db file sequential read             7.75      165,262
1200 db file sequential read             6.74       23,876
1230 db file sequential read             5.30       10,026
1300 db file sequential read             3.34      496,681
1330 db file sequential read             1.58    1,253,208
1400 db file sequential read             8.86      239,247
1430 db file sequential read            12.91      191,376
1500 db file sequential read            19.97       73,061
1530 db file sequential read            17.80       43,662
1600 db file sequential read            12.41      144,741
1630 db file sequential read             8.99      411,254
1700 db file sequential read             8.03      540,138
1730 db file sequential read             9.26      422,317
1800 db file sequential read            19.16      155,787
1830 db file sequential read             6.01      641,517
1900 db file sequential read             4.79      573,674
1930 db file sequential read             2.72      824,991
2000 db file sequential read             1.59      504,650
2033 db file sequential read             1.88      324,741
2100 db file sequential read            24.32       74,026
2130 db file sequential read            16.05       67,545
2200 db file sequential read            15.52      219,928
2230 db file sequential read             9.87      259,956
2300 db file sequential read            15.18      122,362
2330 db file sequential read             9.97       94,124
0000 db file sequential read            14.19       50,264

Tough to see what is going on
Let’s lay it out in a bubble chart. Latency in ms is on the Y-Axis, time on the X-Axis (AWR snapshots every half hour) and size of the ball is the amount of requests:

The Disk file operations I/O doesn’t seem to correlate with disk activity nor latency. When disk activity is high, on the left, with much data coming off spindle (ie 8ms average) Disk file operations I/O is fast, ie 2ms. But at 20:00,  there is medium I/O activity but much of it is coming from cache, ie 2ms average and the Disk file operations are slow.

The database is not using ASM. The number of datafiles is 900. Open file descriptors is set at 8192.

Might be time to run some DTrace on the file operations to get a better idea of what is going on.

.
UPDATE:

Businessman Touching Domino Pieces Arranged in a LineFull disclosure: this database was running over NFS on Delphix OS which supports DTrace. I ran DTrace on all the NFS operations and none took over 2 seconds, which is far from the 4 minutes reported by the database.

Ended writing a script to run pstack on sessions that were waiting for Disk file operation I/O for more than 10 seconds.  When the pstack was taken the first time, there is no guarentee the process was still on the same wait, but if a second pstack was taken for the same process, then it guarenteed that the first pstack was during the wait.  All such waiters that had 2 pstacks  were waiting  in the first pstack for

_close_sys()

So, although Oracle looks to be reporting slow file open operations, the pstacks are showing problems on close file.

Googling for on HP ( the database was running on HP), I found a couple of links, but nothing definitive

http://www.unix.com/hp-ux/30396-socket-close-hangs-cpu-go-100-a.html
http://h30499.www3.hp.com/t5/Networking/CPU-usage-is-100-when-close-sys-call-is-made/td-p/4609196

Oracle reports a few bugs with the wait event “Disk file operation I/O” but none see applicable to this install

Bug 10361651 – File open may hang/spin forever if dNFS is unresponsive [ID 10361651.8]

This bug is suppose to be fixed on 11.2.0.3 and applies to dNFS, though the issues sounds like it would fit – if an I/O operation takes “long” then Oracle’s wait times get corrupted and the process waits much much too long on the event.

 

Uncategorized

EM 12c Snap Clone

September 4th, 2013



Oracle OEM 12c introduces a new feature that enables the creation of Oracle database thin clones by leveraging file system snapshot technologies from either  ZFS  or Netapp.  The OEM adds a graphic interface to the process of making database thin clones. The feature that enables database thin cloning in OEM is called Snap Clone and is part of OEM’s Cloud Control Self Service for data cloning. Snap Clone is available via the feature Database as a Service (DBaaS). Snap clone leverages the copy on write technologies available in some storage systems for database cloning.  Support is initially available for NAS storage and specifically on  ZFS Storage  and NetApp Storage.

In order to use Snap Clone, one has to install the source database such that the source database data files are on a ZFS storage  or Netapp array and have the storage managed by agents on a LINUX machine and then one can thin clone data files on that same storage array.

Snap Clone offers role based access, so storage admin can login in and only have access to areas they are responsible for as well as limiting access to source databases, clones and resource by end users.

Setting Snap Clone

 

The prerequisites for getting start with Snap Clone are having available storage on ZFS Storage Appliance or Netapp storage array as well as having access to a master test database. A master test databse is a database that has a sanatized version of a production database such that it is either a subset and or  masked. The test master database has to be registered with OEM.  After the test master is registered with OEM, Snap Clone can be setup. To set up snap clone, come into Oracle Cloud Control 12c as “cloud administrator” role with “storage adminstator” priviledge or super administrator and register the storage. To register the storage navigate to “ setup -> provisining patching -> storage registration”.

  • Navigate to “ setup -> provisining patching -> storage registration”
  • Click “Register” tab, and choose storage, either Netapp or ZFS,
    • Supply storage information
      • Name: Storage array name registered in DNS
      • Vendor
      • Protocol: http or https
      • Storage Credentials: credentials for interacting with storage
  • Install agents on a separate LINUX machine to manage the Netapp or ZFS storage. An agent has to run on Linux host to manage the  storage. Supply the
    • Agent host
    • Host credentials
  • Pick a database to make the test master
    • Put the test master on ZFS storage or Netapp storage
    • Register the ZFS storage or Netapp storage with OEM
    • Enable Snap Clone for the  test master database
  • Set up a zone – set max CPU and Memory for a set of hosts and the roles that can see these zones
  • Set up a pool – a pool is a set of machines where databases can be provisioned
  • Set up a profile – a source database that can be used for thin cloning
  • Set up a service template – reference values such as a init.ora for database to be created

Screen Shot 2013-06-09 at 10.08.47 PM

Figure 1. Shows the entry page in OEM 12c Cloud Control. From here go to the top right and choose setup, then provisining patching then storage registration as shown above.

 Navigate to storage registration

To setup Snap Clone navigate to storage registyratiom choose the menus “setup -> provisining patching -> storage registration”.

Screen Shot 2013-06-09 at 10.10.44 PM

Figure 2 Shows a zoom into the menus to choose

Screen Shot 2013-05-31 at 10.09.02 PM

Figure 3. Storage Registration Page

Screen Shot 2013-06-09 at 10.15.04 PM

Figure 4. Choose the type of storage array

Once on the Storage Registration page, choose “Register” and then choose the storage, either Netapp or Sun ZFS.

Register the Storage

Screen Shot 2013-06-09 at 10.17.02 PM

Figure 5. Storage Registration Page

To register the storage supply the following information

  • Name: Storage array name registered in DNS
  • Vendor
  • Protocol: http or https
  • Storage Credentials: credentials for interacting with storage

All of which is documented in cloud administration guide.

Define agents used to manage storage

Then define agents used to manage storage. Agents have are required to run on a LINUX host.  More than one agen can define to provide redundancy.  The agents will be the path by which OEM communicates with the storage. For each agent, supply the following information

  • Host name
  • Credential type
  • Credentials

And finally define the frequency with which the agent synchronizes with the storage to gather the sorage hardware details such as  information on aggregates shares volumes.

After the storage information, agent information and agent synchronization information has been filled out, then hit “submit” button in the top right. Hitting the submit button will return the UI back to the “Storage Registration”. On the “Storage Registration”, click on the storage appliance listed in the top, then click on the contents tab on the bottom half of the page. This will list all the  volumns and aggregates in the storage appliance.

Looking at volumns on Storage Array

Screen Shot 2013-06-09 at 10.18.48 PM

Figure 6. Editing storage ceiling by clicking on a aggregate and hitting the “Edit Storage Config” tab.

For each aggregate one can set storage ceilings. Click on the aggregate or FlexVol and the click “Edit Storage Ceilings” tab.

Choosing a Database Test Master

On the database tab is a list of databases that can be used for cloning. OEM detects the database automatically on the hosts it is managing. OEM will also automatically correlate databases that have storage on the storage array added storage registration.  OEM looks for all databases that have files on the registered storage.  Click on database, then the show files tabs which will show the files and volumes for this database.

Screen Shot 2013-06-09 at 10.20.00 PM

Figure 7. List of files by volumn for database

Screen Shot 2013-06-09 at 10.20.07 PM

Figure 8. Enable Snap Clone for databases that will be used as test masters.

Nominating a database as test master requires enabling snap clone. To enable snap clone for a database, click on the chosen database, then click “Enable Snap Clone” tab just above the list of databases. This will automatically validate that all the volumes are flex clone enabled (in the case of Netapp).

 

Setting up Zones

The next step is to configure zone which can be used to organize cloud resources

Choose the menu option “Enterprise -> Cloud -> Midelware and Database Home”

 

Screen Shot 2013-06-09 at 10.23.11 PM

Figure 9. Navigate first to “Cloud -> Middleware and Database Home”

Middleware and Database Cloud page

Screen Shot 2013-06-09 at 10.24.23 PM

Figure 10.  Middleware and Databawe Cloud page

Setting up a Zone

In order to see the zones defined, click on the number next to the title “Paas Infrastructure Zones”  in the top left under General Information.

Screen Shot 2013-06-09 at 10.26.24 PM

Figure 11. PaaS Infrastructure Zones

To create a zone, click the tab “Create”.

Screen Shot 2013-06-09 at 10.27.33 PM

Figure 12. first page of wizard to create a PaaS Infrastructure Zone, give a meaningful name and description of the zone and define maximum CPU utilizaiton and memory allocation.

In the first page of the “PaaS Infrastructure Zone”, give zones a meaningful name and description. Define constraints such as maximum host CPU and memory allocations.

Screen Shot 2013-06-09 at 10.28.56 PM

Figure 13. Second page of the “PaaS Infrastructure Zone” wizard, add hosts that are available to the zone.

Next define hosts that are members of the zone and provide credentials that operate across all members of this zone

Screen Shot 2013-06-09 at 10.30.11 PM

Figure 14. Third page of the “PaaS Infrastructure Zone” wizard, limit which roles can see the zone.

Next define what roles can see and access this szone. The visibiliy of the zone can be limited to a certain class of users via roles like Dev, QA etc

Screen Shot 2013-06-09 at 10.31.18 PM

Figure 15. Final review page for “PaaS Infrastructure Zone” wizard

Finally review settings and click submit

Screen Shot 2013-06-09 at 10.32.29 PM

Figure 16. Showing the Confirmation that the PaaS Infranstructure Zone has been successfully created.

 

Creating Database Pool and Profiles

The remaining steps required to enable snap clone is to create a database pools which is a collection of servers or nodes that have database software installed.  The remaining part of the setup is done by a differnet user who is the administrator for database as a service.

Log in as  DBAAS_ADMIN.

For the next part navigate to the menu “Setup -> Cloud -> Database”.

Screen Shot 2013-06-09 at 10.33.24 PM

Figure 17. Middleware and Database Cloud page

Screen Shot 2013-06-09 at 10.35.59 PM

Figure 18. Navigate to “Setup -> Cloud -> Database”.

Screen Shot 2013-06-09 at 10.36.56 PM

Figure 19. Database Cloud Self Service Portal Setup. To create a database pool choose the “Create” button in the center of the page and from the pull down, choose “For Database”.

To create a new pool click on the “Create” button in the center of the page, and chose “For Database” from the pull down menu that appears.

Screen Shot 2013-06-09 at 10.38.16 PM

Figure 20. Choose “Create -> For Database”

Screen Shot 2013-06-09 at 10.39.10 PM

Figure 21. Edit pool page. Provide a meaningful name a descrpition of the pool. Add Oracle home directories in the bottom of the page. At the very bottom of the page set a constraint on the number of databases instances that can be created in the pool. On the top right, set the host credentials.

Set

  • Name and description
  • Oracle Home
  • Maximum number of databases per host
  • Credentials

In the “Edit Pool” page, at the top left of the screen, provide a meaningful name and description for the pool. In the middle of the screen add Oracle homes that will be used for databse provisioning.  Every member of a database pool is required to be homogeneous. Homogenous requires that the platform and Oracle version is the same across all the hosts  and Oracle homes in the pool. All the Oracle installations also have to be of the same type either single instance or RAC. In the top right  add the  Oracle home provide oracle credentials and root credentials. Finally at the bottom of the page a constraint can be set on the number of database instances that can be started in this pool.

Screen Shot 2013-06-09 at 10.41.03 PM

Figure 22. Set request limits on the pool

The next page sets the request settings. The first restriction sets how far in advanced can requrest can be made. Second restricts  how long a request can be kept which is the archive retension.  After the archive retention time  the  requests will be deleted.  Finally is the request duration which is the maximum duration for which the request can be made.

Screen Shot 2013-06-09 at 10.42.12 PM

Figure 23. Set memory  and storage quotas per role for the pool. The quotas cover memory, storage, database requests and schema requests.

 

The above page  configures quotas.  Quota is allocated to each and every self service user. The quotas controls the amount fo resources users  have access to. Quotas are assigned to a role and users inherit quota values from the role. Click “Create” in the middle of the screen.

Screen Shot 2013-06-09 at 10.43.03 PM

Figure 24. Editing the quotas on a pool for a  role.

The popup dialogue has for these entries

  • Role name
  • memory GB
  • storage GB
  • number of database request

Screen Shot 2013-06-09 at 10.43.56 PM

Figure 25. Profiles and Service Templates

Profiles and service templates

.A profile is use to capture information about the source database which can then be used for provisioning.

A service template is a standardized service definition for a database configuration that is offered to the self service users. A collection of service templates forms the  service catalogue. A service template will provision databsae with or  without seed data. To capture an ideal configuration, the easist thing to do is to point at an existing database and fetch information of interest from that database.  The information from the database can be captured using a profile.

To create a profile click on the “Create” button under “Profiles”

Creating a profile

Screen Shot 2013-06-09 at 10.45.07 PM

Figure 26. Specify a reference target for a profile.

Click the magnifying glass to search for a source database.

Screen Shot 2013-06-09 at 10.46.46 PM

Figure 27. Search for a reference target database

 

Pick a refrence target by clicking on it, the click the “Select” button in bottom right.

Screen Shot 2013-06-09 at 10.47.31 PM

Figure 28. Creating a Database Provisioning Profile

To pick a database for use in thin cloning, choose the check box “Data Content” with suboption selected fro “Structured Data” with sub-option selected for “Create” with sub-option selected for “Storage Snapshot”. This  option is only enabled only when the “enable snapshot” option is enabled on the storage registration page. Disable option capture oracle home.

Provide credentials for the host machines Oracle account and for the database login.

 

The “Content Option” step is not needed for the above selections.

Screen Shot 2013-06-09 at 10.48.16 PM

Figure 29. Give the profile a meaniful name and description

 

Next provide credentials for Oracle home and Oracle databse, then provide a meaningful name for the profile  as well as a location. The profile will be userful when creating a service template.

Screen Shot 2013-06-09 at 10.49.12 PM

Figure 30. Review of create profile options.

 

Next  review the summary and click subit which will connect to storage and take snapshots of the storage


Screen Shot 2013-06-09 at 10.50.46 PM

Figure 31. Shows a zoom into the menus to choose

 

To  create a new service template choose a profile and  in this case use “thin provisioning for reference DB” profile.  Now to create a new service template click “create” and choose “for database”. Service templates are part of the service catalogue and exposed to the self service users.

Screen Shot 2013-06-09 at 10.51.42 PM

Figure 32. Provide a meaningful name and description for the service template.

 

Provide a meaningful name and description. For the rest of service template provide information about the databses that will be created from the snapshots such as providing database type, rac or single instance, for rac provide number of nodes. Provide the SID prefix to appended to the SIDs generated for the clones, provide the Domain Name and the port.

Screen Shot 2013-06-09 at 10.52.38 PM

Figure 33. Provide a storage area for writes to use.

 

The   cloning operation only creates a read only copy thus it is required to provide write space elsewhere in order to allow writing to the thin clone.

click on the edit button

click on volumne, then edit button

Screen Shot 2013-06-09 at 10.53.29 PM

Figure 34. set diretory and maximum space usage for the write location

 

Provide the mount point prefix and amount of writeable space wish to allocate

 

Users of the thin clone databses can  also be allowed to take further snapshots. These snapshots can be used as a mechinism to rollback changes, The number of thiese snapshtos can limited just below storage size section:

Screen Shot 2013-06-09 at 10.54.18 PM

Figure 35. set the number of snapshots that can be taken of a a thin clone

Screen Shot 2013-06-09 at 10.55.15 PM

Figure 36. set the initial passwords for database accounts on the thin clone.

next provide credentials for administrative accounts

  • SYS
  • SYSMAN
  • DBSMNP

for all other non-administartive accounts can choose to leave them as is or change them all to one password you can modify certain init.ora parameters for exmaple memory

Screen Shot 2013-06-09 at 10.55.59 PM

Figure 37. Modify any specific init.ora parameters for the thin clone

 

Screen Shot 2013-06-09 at 10.57.13 PM

Figure 38. Set any pre and post provision scripts to be run at the creation of a thin clone.

 

custom scripts can be provide as pre or post creation steps this can be very useful if you want to register databses with  OID or certian actions that are specific to your organization

Screen Shot 2013-06-09 at 10.58.05 PM

Figure 39. Set the zone and pool for the thin clone

Screen Shot 2013-06-09 at 10.59.07 PM

Figure 40. set the roles that can use the service template.

you associate this srvice template with a zone and a template this insures that the service template can actualy work on the  group of resources that you have identified and can limit the visibility of the service tempalte usein roles

Screen Shot 2013-06-09 at 11.00.17 PM

Figure 41. review of the service template creation requites

finally we review the summary and click submit

Creating  a Thin Clone

Screen Shot 2013-06-09 at 11.06.14 PM

Figure 42. 12c Cloud Control

Screen Shot 2013-06-09 at 11.07.06 PM

Figure 43. 12c Cloud Control Self Service Portal

Contents of the Database Cloud Self Service Potal screen

  • Left hand side
    •  Notification – any instances that are about to expire
    •   Usage
      •     databases (number provisioned out of maximum)
      •     schema services
      •     Memory
      •     Storage
  • Right side
    •   Top
      •     Databases
    • Bottom
      •     requests – requests that created the database services and the database instances

Screen Shot 2013-06-09 at 11.08.02 PM

Figure 44. To clone a database, choose the “Request” then “Database” menu.

Screen Shot 2013-06-09 at 11.08.40 PM

Figure 45. From the list choose a Self Service Template. In this case “SOEDB Service Template”

Options are

  •   RMAN backups which are full clones
  •   empty databases
  •   snap clone which are thin clones

Screen Shot 2013-06-09 at 11.09.42 PM

Figure 46. Fill out the clone Service Request

request wizard asks for

  • request name
  • select zone – collection of servers
  • select a start and end time
  • provide a user name and password, new user and password

Users do not get system access to the databases but instead get a slightly less privilege user who becomes the owner of the database

Hit Submit

Screen Shot 2013-06-09 at 11.10.25 PM

Figure 47. Shows new clone database

 

References

http://www.youtube.com/watch?v=J7fnfLS5Dxg&feature=youtu.be – setup

http://www.youtube.com/watch?v=9VK1z6nU1PU – provisioning

http://www.oracle.com/technetwork/oem/cloud-mgmt/dbaas-snap-netapp-2041775.mp4

http://www.slideshare.net/kellynpotvin/dbaas-database-as-a-service-in-a-dbas-world#!

 

cloning, em, Oracle , ,

Metrics vs Statistics

September 4th, 2013



Here are  the tuning metrics tables (SQL  stats are not in “metric” tables per say)

(*DBA_HIST_…_HISTORY views are sort of confusing. AFAI remember they were storing alert history, but apparently they are used for adaptive thresholds – an area for future investigation)

MH900182646

I’ve noticed a number of people posting queries using DBA_HIST_SYSSTAT instead of DBA_HIST_SYSMETRIC_SUMMARY which leads me to believe that there is some confusion or lack of information on the metric tables.

Oracle 10g introduced metric tables which compute deltas and rates of statistics thus hugely simplifying the ability to answer simple questions like “what is the I/O rate on my databases right now.” This question, before 10g, was surprisingly tedious to answer. To answer the question one would have to query v$sysstat  for example:

Select value from v$sysstat where name=’physical reads’;

but querying v$sysstat just once fails to answer the question but instead answers the question “How much I/O has been done since the database was started”. To answer the original question one would have to query v$sysstat twice and take the delta between the two values:

  • Take value at time A
  • Take value at time B
  • Delta = (B-A)
  • and/or get Rate = (B-A)/elapsed time

Getting these deltas and rates could be a pesky task especially working with a customer over the phone. Then 10g Oracle introduced metric tables which answer the questions in one single query using

V$SYSMETRIC

such as

Select  VALUE , METRIC_UNIT,INTSIZE_CSEC
from v$sysmetric
where metric_name='Physical Reads Per Sec';
VALUE METRIC_UNIT           INTSIZE_CSEC
---------- ----------------- ------------
654.6736 Reads Per Second          5959
134.9835 Reads Per Second          1515

Notice that the query returns 2 rows. The first row is the the last minute (ie 59.59 seconds) and the second row is the last 15 seconds (ie 15.15 seconds). Oracle collects both the deltas and rates for 60 second and 15 second intervals.

Oracle has the average, maximum, minimum for the values for the last hour in

V$SYSMETRIC_SUMMARY

that one can query like:

select MAXVAL,MINVAL,AVERAGE,STANDARD_DEVIATION
from V$SYSMETRIC_SUMMARY
where metric_name='Physical Reads Per Sec';

MAXVAL     MINVAL    AVERAGE      STANDARD_DEVIATION
---------- ---------- ---------- ------------------
3.71784232          0 .076930034         .478529283

Also for the last hour Oracle stores the 60 second intervals and for the last 3 minutes the 15 second intervals in

V$SYSMETRIC_HISTORY

 

Then for the last week by default, Oracle saves the values for each hour including the maximum, minimum, average, stddev etc in

 

DBA_HIST_SYSMETRIC_SUMMARY

 

Issues

One issue with using

  • V$SYSMETRIC – last 15 and 60 seconds
  • V$SYSMETRIC_SUMMARY – values  last hour (last snapshot)  like avg, max, min etc
  • V$SYSMETRIC_HISTORY – last hour for 1 minute, last 3 mintes for 15 second deltas
  • DBA_HIST_SYSMETRIC_SUMMARY – hour summaries for last week.

is becoming familiar with the statistics names which are different from v$sysstat. We can look at

V$METRICNAME

For the group_names (statistic definitions)

For trending data over multiple days, the view DBA_HIST_SYSMETRIC_SUMMARY can be used.  The view has a history of all theSystem Metrics Long Duration statistics.  If you store multiple databases in the same AWR repository you can check the statistics available to a particular DBID with the view DBA_HIST_METRIC_NAME.

 

 

Easy query

 

The view DBA_HIST_SYSMETRIC_SUMMARY  can be queried easily for trending metrics, such as the simple query for bytes read by the database per second:

select   dbid, to_char( begin_time ,'YYYY/MM/DD HH24:MI'),
         round(average)
from     dba_hist_sysmetric_summary
where    metric_name= 'Physical Read Total Bytes Per Sec' /* and DBID=[dbid if share repository] */
order by begin_time;

Tougher Query

 

Compare the above query  to the same query on DBA_HIST_SYSSTAT (note there are a lot of stats in v$sysstat)

with stats as (
           select sn.dbid,
                  st.stat_name,
                  to_char(cast(begin_interval_time as date ), 'YYYY/MM/DD HH24:MI') btime,
                  -- Lag (st.value) OVER( PARTITION BY st.stat_name,st.dbid ORDER BY sn.snap_id)
                  Lag (st.value) OVER( PARTITION BY st.stat_name,st.dbid,st.startup_time ORDER BY sn.snap_id)
                               value_beg,
                  st.value     value_end,
                  (cast(end_interval_time as date) - cast(begin_interval_time as date )) * (24*60*60) delta
           from
                  DBA_HIST_SYSSTAT  st,
                  DBA_HIST_SNAPSHOT sn
           where
                    sn.snap_id=st.snap_id and
                    sn.dbid=st.dbid and
                    (st.stat_name= 'physical read total bytes')
           order by begin_interval_time
     )
   select
          dbid, btime,
          round((value_end-value_beg)/delta) rate_per_sec
   from stats
   where (value_end-value_beg) > 0
;

Its a bit disconcerting to note that the above two queries don’t return the exact same data on my laptop. If it was roughly the same that would be fine, and in general the stats are the similar but there are cases where they differ dramatically.  I don’t see anything obvious in the way the queries are written.  Possibly has to do with database bounces or the way the database is affected by the laptop’s sleep and hibernate modes. Will have to look into this farther.

One trick to make the data easy to load into Excel is to use the html output format and spool to a file with an “.html” extension

SET markup HTML on
spool output.html

 

Other info

 

List of DBA_HIST views

https://sites.google.com/site/oraclemonitor/awr-views

 

Power Struggle Between a Man and a Woman

Oracle, performance , , ,

Wait Metrics vs Wait Events

September 3rd, 2013



Here is a quick table comparison of  different types of metrics views

metric_tables

The first line of the table is the classic wait event and statistic views. The following lines are the metric views.  The metric views were introduced in Oracle 10g.

Why Metrics are good

Metric views compute deltas and rates  which hugely simplifying the ability to answer simple questions like “what is the I/O rate on my databases right now?” This question, before 10g, was surprisingly tedious to answer. To answer the question one would have to query v$sysstat  for example:

Select value from v$sysstat where name=’physical reads’;

but querying v$sysstat just once fails to answer the question but instead answers the question “How much I/O has been done since the database was started?” To answer the original question one would have to query v$sysstat twice and take the delta between the two values:

  • Take value at time A
  • Take value at time B
  • Delta = (B-A)
  • and/or get Rate = (B-A)/elapsed time

Getting these deltas and rates could be a pesky task especially working with a customer over the phone. Then 10g Oracle introduced metric tables which answer the questions in one single query .

Using Metrics with Waits

Time business concept.

The metric views  apply to wait events as well as  statistics. In a future posting we will go over statistics. In this posting we will go over wait events.  The number of views available to analyze wait events can be confusing. The point of this post is to clarify what the different views available are and how they can be used.

The wait event views are   (at system level)

  • V$SYSTEM_EVENT – wait events cumulative since startup
  • V$EVENTMETRIC – wait event deltas last 60 seconds
  • DBA_HIST_SYSTEM_EVENT – wait events by snapshot (hour) for last week, cumulative since startup

The wait events are rolled up in to groups called wait classes. For wait class we have the following views:

  • V$SYSTEM_WAIT_CLASS – cumulative since start up
  • V$WAITCLASSMETRIC – last 60 seconds deltas
  • V$WAITCLASSMETRIC_HISTORY – 60 seconds deltas for last hour

Note: DBA_HIST_WAITCLASSMETRIC_HISTORY is used for alerts and or baselines not everyday values.

Use Wait Event Metrics for Latency

I use wait event metrics for I/O latencies.

It may be surprising that I don’t mention using waits to identify bottlenecks and load on the system. For bottlenecks and load on the system the data in V$ACTIVE_SESSION_HISTORY (ASH) is probably better for a few reasons. One the data in ASH is mult-dimesional so it can be grouped by SQL and Session Also CPU information is derivable from ASH. CPU information is not in the event/waitclass views but is in ASH along with the waits.

The second part, the  latencies, specifically I/O latencies,  are only available in the wait event and waitclass views (and the filestat views on a per file basis)

User I/O latency with WAIT CLASS

One  use  of wait  metrics is determining the average read I/O  for all the various kinds of read I/O and read sizes:

select 10*time_waited/nullif(wait_count,0) avg_io_ms -- convert centi-seconds to milliseconds
from   v$waitclassmetric  m
       where wait_class_id= 1740759767 --  User I/O
/
 AVG_IO_MS
----------
     8.916

One issue with V$WAITCLASSMETRIC is that the field WAIT_CLASS name is not in the view, so we either have to use the WAIT_CLASS_ID (the hash of the name)  as above or join to V$SYSTEM_WAIT_CLASS as below

select
        10*m.time_waited/nullif(m.wait_count,0) avgms -- convert centisecs to ms
from   v$waitclassmetric  m,
       v$system_wait_class n
where m.wait_class_id=n.wait_class_id
  and n.wait_class='User I/O'
/
 AVG_IO_MS
----------
     8.916

Another issue is that the documentation for 11gR2 says that the TIME_WAITED is microseconds but in my tests it’s actually centisecs

 desc V$SYSTEM_WAIT_CLASS
Name                                        Type
-----------------------------------------  ----------------------------
WAIT_CLASS_ID                                NUMBER
WAIT_CLASS#                                  NUMBER
WAIT_CLASS                                   VARCHAR2(64)
TOTAL_WAITS                                  NUMBER
TIME_WAITED                                  NUMBER  - centi-seconds

You can get a list of all the WAIT_CLASS names in the view V$SYSTEM_WAIT_CLASS.  

 select wait_class_id , wait_class from V$SYSTEM_WAIT_CLASS ;

WAIT_CLASS_ID WAIT_CLASS
------------- ----------------------------------------------------------------
   1893977003 Other
   4217450380 Application
   3290255840 Configuration
   4166625743 Administrative
   3875070507 Concurrency
   3386400367 Commit
   2723168908 Idle
   2000153315 Network
   1740759767 User I/O
   4108307767 System I/O

Latencies for specific I/O  Wait Events

For specific I/O latencies there are two choices – v$eventmetric and v$system_event. With v$system_event it requires  running multiple queries and taking the deltas but the deltas are are already calculated in v$eventmetric. Here is an example of getting I/O latencies from v$eventmetric

Latencies in the past minute

col name for a25
select m.intsize_csec,
       n.name ,
       round(m.time_waited,3) time_waited,
       m.wait_count,
       round(10*m.time_waited/nullif(m.wait_count,0),3) avgms
from v$eventmetric m,
     v$event_name n
where m.event_id=n.event_id
  and n.name in (
                  'db file sequential read',
                  'db file scattered read',
                  'direct path read',
                  'direct path read temp',
                  'direct path write',
                  'direct path write temp',
                  'log file sync',
                  'log file parallel write'
)
/
INTSIZE_CSEC NAME                      TIME_WAITED WAIT_COUNT      AVGMS
------------ ------------------------- ----------- ---------- ----------
        6017 log file parallel write         2.538          4      6.345
        6017 log file sync                   2.329          1     23.287
        6017 db file sequential read             0          0
        6017 db file scattered read              0          0
        6017 direct path read                    0          0
        6017 direct path read temp               0          0
        6017 direct path write                   0          0
        6017 direct path write temp              0          0

Latencies averaged over each hour for specific I/O Wait Events

select
       btime,
       round((time_ms_end-time_ms_beg)/nullif(count_end-count_beg,0),3) avg_ms
from (
select
       to_char(s.BEGIN_INTERVAL_TIME,'DD-MON-YY HH24:MI')  btime,
       total_waits count_end,
       time_waited_micro/1000 time_ms_end,
       Lag (e.time_waited_micro/1000)
              OVER( PARTITION BY e.event_name ORDER BY s.snap_id) time_ms_beg,
       Lag (e.total_waits)
              OVER( PARTITION BY e.event_name ORDER BY s.snap_id) count_beg
from
       DBA_HIST_SYSTEM_EVENT e,
       DBA_HIST_SNAPSHOT s
where
         s.snap_id=e.snap_id
   and e.event_name like '%&1%'
order by begin_interval_time
)
order by btime
/
BTIME               AVG_MS
--------------- ----------
20-JUL-11 06:00      5.854
20-JUL-11 07:00      4.116
20-JUL-11 08:00     21.158
20-JUL-11 09:02      5.591
20-JUL-11 10:00      4.116
20-JUL-11 11:00      6.248
20-JUL-11 12:00     23.634
20-JUL-11 13:00     22.529
20-JUL-11 14:00      21.62
20-JUL-11 15:00     18.038
20-JUL-11 16:00     23.127

What are the sizes of the I/O requests?

One issue with looking at I/O latencies is determining the I/O sizes.  It would be awesome if there was a view with I/O counts, sizes and latencies in one place. ASH does have this information  but ASH data is weighted to the longer latencies and sizes and not the average. The average sizes have to be gotten from system statistics. The I/O sizes for ‘db file sequential read’ are single block reads so are single value that can be determined , but the other read events can vary in size. To get a general idea of I/O sizes one could just average across all I/O using the system statistics

Average I/O Size (across all I/O waits)

select
          sum(decode(metric_name,'Physical Reads Per Sec',value,0))*max(intsize_csec)/100  blocks_read,
          nullif(sum(decode(metric_name,'Physical Read IO Requests Per Sec',value,0)),0)*max(intsize_csec)/100  reads,
            sum(decode(metric_name,'Physical Reads Per Sec',value,0))/
          nullif(sum(decode(metric_name,'Physical Read IO Requests Per Sec',value,0)),0) avg_blocks_read
from v$sysmetric
where group_id = 2  -- 60 second deltas only (not the 15 second deltas);

BLOCKS_READ      READS AVG_BLOCKS_READ
----------- ---------- ---------------
       4798       4798               1

 

Load and Bottlenecks

The good thing about wait classes is that they simplify dealing with 1000s of wait events and group them into just a few wait classes. We can get a quick view of load on the system with

select n.wait_class, round(m.time_waited/m.INTSIZE_CSEC,3) AAS
from   v$waitclassmetric  m,
       v$system_wait_class n
where m.wait_class_id=n.wait_class_id
and n.wait_class != 'Idle'
;
WAIT_CLASS             AAS
--------------- ----------
Other                    0
Application              0
Configuration            0
Administrative           0
Concurrency              0
Commit                   0
Network                  0
User I/O              .149
System I/O            .002

but the big drawback with wait event and/or wait class views is that they lack information on CPU load. CPU load can be found in the system statistics but it’s just easier to do it all in one query using v$active_session_history. Here is a query using ASH to calculate AAS load on the database over the last 60 seconds:

select
            round(count(*)/secs.var,3)     AAS,
            decode(session_state,'ON CPU','CPU',wait_class)  wait_class
       from v$active_session_history ash,
            (select 60 var from dual)  secs
       where
            SAMPLE_TIME > sysdate - (secs.var/(24*60*60)) and
            SESSION_TYPE = 'FOREGROUND'
       group by decode(session_state,'ON CPU','CPU',wait_class) , secs.var
/
      AAS WAIT_CLASS
---------- ---------------
      .016 Concurrency
      .001 Network
         0 Other
      .083 Configuration
      .001 Administrative
      .034 CPU
         0 System I/O
      .001 Commit
      .054 Application
         0 User I/O

Though with v$sysmetric it’s pretty easy to join to v$waitclassmetric

  select n.wait_class, round(m.time_waited/m.INTSIZE_CSEC,3) AAS
   from   v$waitclassmetric  m,
          v$system_wait_class n
   where m.wait_class_id=n.wait_class_id
   and n.wait_class != 'Idle'
  union
   select  'CPU', round(value/100,3) AAS from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2 ;
WAIT_CLASS                                                              AAS
---------------------------------------------------------------- ----------
Administrative                                                            0
Application                                                            .009
CPU                                                                   1.696
Commit                                                                    0
Concurrency                                                            .001
Configuration                                                             0
Network                                                                .002
Other                                                                     0
System I/O                                                                0
User I/O                                                                  0

and adding v$sysmetric into the query allows me to do something I’ve always wanted which is to include the OS CPU in AAS

 select n.wait_class, round(m.time_waited/m.INTSIZE_CSEC,3) AAS
   from   v$waitclassmetric  m,
          v$system_wait_class n
   where m.wait_class_id=n.wait_class_id
   and n.wait_class != 'Idle'
  union
   select  'CPU', round(value/100,3) AAS from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2
  union
   select 'CPU_OS', round((prcnt.busy*parameter.cpu_count)/100,3) - aas.cpu
  from
  ( select value busy from v$sysmetric where metric_name='Host CPU Utilization (%)' and group_id=2 ) prcnt,
  ( select value cpu_count from v$parameter where name='cpu_count' )  parameter,
  ( select  'CPU', round(value/100,3) cpu from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2) aas
;
WAIT_CLASS                                                              AAS
---------------------------------------------------------------- ----------
Administrative                                                            0
Application                                                               0
CPU                                                                    .009
CPU_OS                                                                 .024
Commit                                                                    0
Concurrency                                                               0
Configuration                                                             0
Network                                                                .002
Other                                                                     0
System I/O                                                                0
User I/O                                                                  0

One huge loss over using ASH is the loss of the information users waiting for CPU but not running on CPU.

For further reading see

  • oramon.sh – shell script to display I/O latency from v$system_event
  • Oracle CPU time – how to see Oracle’s usage and demand of CPU

 

09-45-55-583_950x510

 

Oracle, performance, wait events , ,