Archive

Archive for August, 2013

SQL joins visualized in a surprising way

August 22nd, 2013

delphix_logo_color      VDBs      delphix_logo_color


Saw a good posting on SQL joins today that echoes a classic image of SQL joins:

Visual_SQL_JOINS_orig

I loved this graphic when I first saw it. Seeing the graphic made me think “wow, I can actually wrap my mind around these crazy SQL joins.”

But there is more to SQL joins than meets the eye, at least in these pictures. These pictures leave out the effects of projection and amplification. For example, just taking the simplest case of a two table join (an inner join):

Screen-Shot-2013-06-25-at-10.16.39-PM

The intersection in the above graphic makes sense and looks simple, but to get a simple intersection requires that the two sets, the two tables,  be related by one-to-one relationships. Let’s take a simple query to illustrate the point:

Screen-Shot-2013-06-26-at-1.00.06-PM

In the query we join tables A and B. In blue I’ve added a “predicate filter” which limits the rows that are joined.


Screen-Shot-2013-06-25-at-10.16.48-PM

In a one to one relationship, for every value 1 in table A.field there will be one and only one value in table B.field. This is what the join diagram is showing, but how often do data models have one-to-one relationships? Sure it happens once in a while but the main relationship is one-to-many, which actually causes projections and not intersections.


Screen-Shot-2013-06-25-at-10.04.35-PM

Screen-Shot-2013-06-25-at-10.04.29-PM

Most often tables are related by a one to many relationship, a parent to child relationship. For example one customer can have many orders, but each order pertains to one and only one customer. With one to many we no longer get the neat intersections of two circles but a “projection” of one set onto the other. The number of rows is limited by the maximum rows returned by the predicate filter in table A and B.

The most surprising case is amplification or multiplying of rows returned due to many-to-many relationships and thus illustrating part of the reason why many-to-many relationships are problematic


Screen-Shot-2013-06-25-at-10.04.15-PM

Screen-Shot-2013-06-25-at-10.04.24-PM

With many to many relationships, the maximum rows returned is the number of rows returned in A with the predicate filter multiplied by the number of rows in B returned  after the predicate filter  all divided  by the minimum of the number of distinct values (NDV) returned on A or B after the predicate filter is applied.

In the above example that is (4*2)/ min(1,1) = 8

With more rows returned by either or both of table A and B, the effect can flood a query with rows to process.

All of this illustrates to me that SQL can be complex, more complex than the useful graphics at the top of the page would

Uncategorized

Facebook Schema and Performance

August 21st, 2013


From the article “Facebook shares some secrets on making MySql scale

“800 million users and handling more than 60 million queries per second” …”4 million row changes per second.”

and that was almost a two years ago. Think what it’s like now!

Ever wonder why Facebook limits your friends to 5000?  Does Facebook want to stop people from using it to promote themselves?

Ever see this message “There are no more posts to show right now” on Facebook?

Notice it says “There are no more posts to show right now.”

I got this message when scrolled back in “friends” status updates. After scrolling back a few days I hit this message. The message is strange since  thousands of more status updates that Facebook could have shown me. Why did I run into this message?

Is Facebook just being capricious or dictatorial in how it is used? I don’t know but I think the more likely answer is much more mundane and possibly quite interesting. The reason may be just simple technical limitations.

How could would/should/could status updates be stored on Facebook?

The first thing that comes to mind is something like these tables in a relational database:In the above design there are 3 tables

  • Facebook accounts
    • ID
    • email
    • other fields
  • Friends
    • user id
    • friend’s id (contact id or c_id)
  • Status Updates
    • ID of the account making the update
    • status update
    • date of status update

So if sue@y.com logs onto Facebook, then Facebook needs to go and get the status updates of her friends/contacts. First step is to get a list of friends and second step is to get a list of updates from those friends. In SQL this might look like:

    Select  id, status
    From updates
    where id in (select c_id from contacts where id=2)
    order by date

As the number of friends and status updates increases, then this query is going to take longer and longer. Maybe this is the reason why Facebook limits the number of friends and the history.  How can the response time for  the retreval of updates of friends be kept at constant time ?

First, the home page only has to show, at least initially, something like 20 updates. The above query can be wrapped with a top 20 s0mething like

   select * from (
      Select  id,status
      From updates
      where id in (select c_id from contacts where id=2)
      order by date)
   where rownum < 20;

But really, that’s not going to do much good because the query still has to create the result set before sorting it by date then limiting the output to 20 rows. You could add a date limiter on the updates:

   select * from (
      Select  id,status
      From updates
      where id in (select c_id from contacts where id=2) and
      date <= current_date - 2_days
      order by date)
   where rownum < 20;

Seems facebook has a limit on the number of days returned and the number of friends, but there isn’t AFAIK, a limit on the number of updates that friends can do, so as they do more updates, the query takes longer and longer.

What kind of other design could be used? To speed up the query data could be denormalized a lot or a little. For a small change in the data, the date could be added to the list of friends meaning we can limit updates by the date field in  friends instead of all the updates themselves  as in:
Now the query becomes something like

   Select  status
   From updates
   where id in  (  select c_id from
                    (select c_id from contacts where id=2  order by date)
               where rownum < 20 )
   order by date

Instead of having to select status updates from all the friends, the query just selects the 20 (or less) friends who have had the most recent updates.

Or one could go a step farther such that when you post a status update,  a row gets inserted for each of your friends,  such that every friend has your update associeted with them and then all that has to be done is select the top 20 updates from that list. No joining. And if  indexed, then the rows returned can be precisely limited to those 20 rows. On the other hand this creates an enormous amount of insert data and data redundancy. Maybe have two tables, 1 status updates with a unique id and 2  a table with all friends updates. The second table would have every user and for each user a line that contains the status update ids of all their friends and a timestamp.    So if I wanted status updates for my friends, I just get the last 20 status update ids from this table for me and then get the actual content for 20 status updates. Still this keeps a lot of unnecessary information. On the other hand I don’t need to keep the data for that long – maybe the last couple days and beyond that the system could fall back to some of the join query above.

What other kinds of optimizations could they do ?  What would the pros be of a other methods? What are the cons?

This has already been solved a number of times at a number of places.  I haven’t been involved in any nor am I involved in any of these architectural questions right now, but it’s interesting to think about.

Why does Facebook want to know who your close friends are? Is it because they care or because it helps prioritize what status up dates to denormalize? Why do the limit friends  to 5000? Is it because they really care or is scaling issue?

 

Related Reading:

Twitter

id generation

http://engineering.twitter.com/2010/06/announcing-snowflake.html

http://highscalability.com/blog/2011/12/19/how-twitter-stores-250-million-tweets-a-day-using-mysql.html

Facebook schema

http://www.flickr.com/photos/ikhnaton2/533233247/

Facebook lamp stack

http://itc.conversationsnetwork.org/shows/detail4746.html

how does Facebook do it

http://ask.metafilter.com/82769/How-is-Facebook-doing-its-queries

ebay

http://www.addsimplicity.com/downloads/eBaySDForum2006-11-29.pdf

high scalability

http://highscalability.com/

http://memcached.org/

scaling

http://danga.com/words/2007_06_usenix/usenix.pdf

Flickr

http://radar.oreilly.com/archives/2006/04/database-war-stories-3-flickr.html

Myspace

http://www.baselinemag.com/c/a/Projects-Networks-and-Storage/Inside-MySpacecom/

dealing with stale data

http://www.mnot.net/blog/2007/12/12/stale

Facebook schema

http://upload.wikimedia.org/wikipedia/commons/9/95/Metamodel_of_Facebook.jpg

Uncategorized ,

Delphix finalist for UKOUG partners of the year

August 20th, 2013

1pya13-ef-vfu-v3

Delphix  is a finalist for UKOUG partners of the year!

See the finalist at UKOUG finalists for partner of the year

Vote here : UKOUG parnter of the year awards 2013

Also on the list are fellow Oaktable related organizations Scalabilities and Enkitec


Delphix


 

What are your key reasons for choosing this category?

Delphix is an Oracle partner that has developed the ‘Database Virtualisation’ technology for the Oracle community. The Delphix solution enables businesses to operate differently by empowering developer, test and QA teams to generate full size databases on-demand in minutes with almost no storage overhead reducing data center operational and storage costs and increasing coporate agility.

How does your company differentiate itself against other Oracle partners focused in this area?

The Delphix solution is shipped as a Software Appliance and is hardware agnostic, this means the customer does not suffer ‘Hardware Lock-In’ like other less functionally rich solutions.

Please give at least one key example of your company’s achievements in this area.

easy

Corporate Express Presses ‘Easy Button’ for Database Provisioning, Refresh, and Data Recovery with Delphix:

  • 12:1 storage reduction
  • Self-service database provisioning and refresh
  • Faster and more agile application development
  • Consolidated and more predictable disaster recovery

“We moved from complexity and cost to a few clicks that developers can manage themselves using standard web browsers. Previously our environments were averaging two refreshes per year. Now developers can provision or refresh VDBs at any time. The ease of use and self service has also freed our senior IT personnel to focus on higher value projects for the business.”

– Andrew Grech, Infrastructure & Operations Manager at Corporate Express.

Please describe how your company has made a difference to a customer’s success in the last 12 months.

At KLA-Tencor they were managing to do 4 SAP customization projects a year and after Delphix they were able to complete 22 SAP customization project by increasing the number of copies that could be made concurrently and decreasing the time to make a copy and to do refreshes.

The Delphix solution has been deployed at a Global Bank as part of a Operational Excellence programme. The Delphix technology has improved business agility and productivity by reducing database refreshes times from over a week to minutes. The Bank has also seen major storage reductions with multi-terra bytes databases created on-demand consuming 10′s GBs of storage.

Supporting document:

Informatica Case Study

Uncategorized

Latency heatmaps in D3 and Highcharts

August 20th, 2013


See Brendan Gregg’s blog on how important and cool heatmaps can be for showing latency information and how average latency hides what is really going on:

Now if we want to create heatmap graphics, how can we do it? Two popular web methods for displaying graphics are Highcharts and D3. Two colleges of mine whipped up some quick examples in both Highcharts and D3 to show latency heatmaps and those two examples are shown below. The data in the charts is random just for the purposes of showing examples of these graphics in actions.

Highcharts Heatmap


see code at http://jsfiddle.net/vladiweb/GNd3G/


 

D3 Heatmap


see code at
http://jsfiddle.net/eyalkaspi/YH8sw

graphics , ,

Looking for past posts?

August 19th, 2013

My previous blog had out grown the limited area of just Oracle. The current blog will cover broader areas of data mobility, agility and visualization.

If there are past posts that you are looking for, please drop me a line and I will prioritize them for posting on this blog.

 

 

Uncategorized