Facebook Schema and Performance

August 21st, 2013


From the article “Facebook shares some secrets on making MySql scale

“800 million users and handling more than 60 million queries per second” …”4 million row changes per second.”

and that was almost a two years ago. Think what it’s like now!

Ever wonder why Facebook limits your friends to 5000?  Does Facebook want to stop people from using it to promote themselves?

Ever see this message “There are no more posts to show right now” on Facebook?

Notice it says “There are no more posts to show right now.”

I got this message when scrolled back in “friends” status updates. After scrolling back a few days I hit this message. The message is strange since  thousands of more status updates that Facebook could have shown me. Why did I run into this message?

Is Facebook just being capricious or dictatorial in how it is used? I don’t know but I think the more likely answer is much more mundane and possibly quite interesting. The reason may be just simple technical limitations.

How could would/should/could status updates be stored on Facebook?

The first thing that comes to mind is something like these tables in a relational database:In the above design there are 3 tables

  • Facebook accounts
    • ID
    • email
    • other fields
  • Friends
    • user id
    • friend’s id (contact id or c_id)
  • Status Updates
    • ID of the account making the update
    • status update
    • date of status update

So if sue@y.com logs onto Facebook, then Facebook needs to go and get the status updates of her friends/contacts. First step is to get a list of friends and second step is to get a list of updates from those friends. In SQL this might look like:

    Select  id, status
    From updates
    where id in (select c_id from contacts where id=2)
    order by date

As the number of friends and status updates increases, then this query is going to take longer and longer. Maybe this is the reason why Facebook limits the number of friends and the history.  How can the response time for  the retreval of updates of friends be kept at constant time ?

First, the home page only has to show, at least initially, something like 20 updates. The above query can be wrapped with a top 20 s0mething like

   select * from (
      Select  id,status
      From updates
      where id in (select c_id from contacts where id=2)
      order by date)
   where rownum < 20;

But really, that’s not going to do much good because the query still has to create the result set before sorting it by date then limiting the output to 20 rows. You could add a date limiter on the updates:

   select * from (
      Select  id,status
      From updates
      where id in (select c_id from contacts where id=2) and
      date <= current_date - 2_days
      order by date)
   where rownum < 20;

Seems facebook has a limit on the number of days returned and the number of friends, but there isn’t AFAIK, a limit on the number of updates that friends can do, so as they do more updates, the query takes longer and longer.

What kind of other design could be used? To speed up the query data could be denormalized a lot or a little. For a small change in the data, the date could be added to the list of friends meaning we can limit updates by the date field in  friends instead of all the updates themselves  as in:
Now the query becomes something like

   Select  status
   From updates
   where id in  (  select c_id from
                    (select c_id from contacts where id=2  order by date)
               where rownum < 20 )
   order by date

Instead of having to select status updates from all the friends, the query just selects the 20 (or less) friends who have had the most recent updates.

Or one could go a step farther such that when you post a status update,  a row gets inserted for each of your friends,  such that every friend has your update associeted with them and then all that has to be done is select the top 20 updates from that list. No joining. And if  indexed, then the rows returned can be precisely limited to those 20 rows. On the other hand this creates an enormous amount of insert data and data redundancy. Maybe have two tables, 1 status updates with a unique id and 2  a table with all friends updates. The second table would have every user and for each user a line that contains the status update ids of all their friends and a timestamp.    So if I wanted status updates for my friends, I just get the last 20 status update ids from this table for me and then get the actual content for 20 status updates. Still this keeps a lot of unnecessary information. On the other hand I don’t need to keep the data for that long – maybe the last couple days and beyond that the system could fall back to some of the join query above.

What other kinds of optimizations could they do ?  What would the pros be of a other methods? What are the cons?

This has already been solved a number of times at a number of places.  I haven’t been involved in any nor am I involved in any of these architectural questions right now, but it’s interesting to think about.

Why does Facebook want to know who your close friends are? Is it because they care or because it helps prioritize what status up dates to denormalize? Why do the limit friends  to 5000? Is it because they really care or is scaling issue?

 

Related Reading:

Twitter

id generation

http://engineering.twitter.com/2010/06/announcing-snowflake.html

http://highscalability.com/blog/2011/12/19/how-twitter-stores-250-million-tweets-a-day-using-mysql.html

Facebook schema

http://www.flickr.com/photos/ikhnaton2/533233247/

Facebook lamp stack

http://itc.conversationsnetwork.org/shows/detail4746.html

how does Facebook do it

http://ask.metafilter.com/82769/How-is-Facebook-doing-its-queries

ebay

http://www.addsimplicity.com/downloads/eBaySDForum2006-11-29.pdf

high scalability

http://highscalability.com/

http://memcached.org/

scaling

http://danga.com/words/2007_06_usenix/usenix.pdf

Flickr

http://radar.oreilly.com/archives/2006/04/database-war-stories-3-flickr.html

Myspace

http://www.baselinemag.com/c/a/Projects-Networks-and-Storage/Inside-MySpacecom/

dealing with stale data

http://www.mnot.net/blog/2007/12/12/stale

Facebook schema

http://upload.wikimedia.org/wikipedia/commons/9/95/Metamodel_of_Facebook.jpg


Uncategorized
,

  1. Trackbacks

  2. No trackbacks yet.
  1. Comments

  2. gregoire
    | #1

    Really interesting. I’m not able to help on that but I would surely be curious to get the answer. I’ve often wondered if and what kind of RDBMS Amazon, facebooks and others are using. Do you think Oracle is able to handle this or it has to be some other big systems or internel devs?

  3. khailey
    | #2

    Comments from Fabian Pascal via this post on Linkedin:
    Kyle, Schema is logical, does not affect performance. Whtever performance you get with any schema is a function of physical implementation factors, of which there are many. To the extent that you modify the schema and performance changes, most of it has to do with the poor support of data independence.

  4. khailey
    | #3

    @Fabian: Great points Fabian. So one awesome answer to this investigation would be to give a schema and then include how that schema would be stored to insure the required performance.

  5. khailey
    | #4

    @gregoire: I don’t see any reason why Oracle can’t handle it. Iggy Fernandez just gave a great presentation at NoCOUG comparing an Amazon NoSQL solution to an Ebay Oracle solution and as I understood it, both systems worked fine but with the Oracle system you got referential integrity, ACID-ity and joinable structures.
    The one drawback of Oracle, or at least it was, is that one is generally obligated to be constrained by overhead of transactional consistency whether one wants it or not, but that overhead is a small price to pay for the wide array of robust solutions possible with Oracle. For example with Oracle, if I want to manipulate some scratch tables and I don’t care whether they get lost or inconsistent because I can recreate them and I just want speed, then there is no way to do it. I can use GTTs which eliminates some of the overhead but I can’t pass my GTT contents to someone else without going through a transaction layer.

  6. khailey
    | #5

    Why use NoSQL without ACID-ity, with limited or no joins, without referential integrity when you can have the same solutions, same performance and gain ACID-ity, relational integrity and have access to all possible join queries imaginable? Iggy Fernadez breaks down a NoSQL solution at Amazon and compares it to an Oracle solution at Ebay. Check out Iggy’s presentation at
    http://www.nocoug.org/download/2013-08/NOCOUG_201308_Soul-searching_for_the_Relational_Camp.pdf

  7. | #6

    @khailey

    Thanks. The Amazon vs. eBay comparison is the key point of my presentation since performance, scalability, availability are the NoSQL claims to fame and eBay proved that you can achieve the same goals without abandoning SQL.

  8. khailey
    | #7

    Fabian Pascal: I don’t know how this is a solution. The real one is a DBMS implementation that supports PDI and offers as many storage and access methods as possible, such that it performs well with a fully normalized schema. But instead on focusing on that the industry is busy with NoSQL, Clouds and objects. l

  9. khailey
    | #8

    @Fabian Pascal: Good points and strikes to the heart of the matter but if I’m at a startup that’s going to be the new Facebook it doesn’t matter to me what the best theoretical database software is, I have to pick from the software that exists in the market. As for storage and access methods, Oracle offers many and so part of the solution space along with the schema design would be outlining how to store the data. Iggy Fernandez has a nice presentation on using various storage methods to achieve NoSQL flexibility and speed but with relational structure and ACID-ity http://www.nocoug.org/download/2013-08/NOCOUG_201308_Soul-searching_for_the_Relational_Camp.pdf

  10. | #9

    The NoSQL camp is strong because of its focus on performance, scalability, and availability. These are physical issues that we in the relational camp have told two generation of application developers to leave to the DBMS and the DBA.

    RE: A DBMS implementation that supports [physical data independence] and offers as many storage and access methods as possible, such that it performs well with a fully normalized schema.

    That would be Oracle. Since its very early versions, Oracle has provided the “table cluster” storage method using which related data from multiple tables can be physically colocated on the same database block thus eliminating the join penalty. Codd uses the following example in his 1970 paper introducing relational theory.

    employee (employee#, name, birthdate)
    jobhistory (employee#, jobdate, title)
    salaryhistory (employee#, jobdate, salarydate, salary)
    children (employee#, childname, birthyear)

    If the above four tables are clustered by the employee# field, data for an employee from all four tables can be colocated on the same database block.

    In other words, Oracle eliminated the join penalty decades ago.

    However, when normalization is discussed, there is always the implication that normalized tables are the ones that should be physically stored and a whole generation of application developers has come to believe this. In his 1970 paper, Codd differentiated between the stored set, the named set, and the expressible set. Consider an unnormalized table A such that A = B NATURAL JOIN C; that is, B and C are projections of A obtained by a lossless decomposition of A, a.k.a normalization. The typical practice is to store B and C and force the DBMS to incur a join penalty by performing joins at run time. This is not efficient. Why not store just A, leaving B and C in the named set and leaving it to the DBMS to prevent inconsistencies from occurring?

    Also, why not let the developers decide whether inconsistencies should be detected and prevented in real-time. Codd seems to have been a very pragmatic man because, in his 1970 paper, he says:

    “There are, of course, several possible ways in which a system can detect inconsistencies and respond to them. In one approach the system checks for possible inconsistency whenever an insertion, deletion, or key update occurs. Naturally, such checking will slow these operations down. If an inconsistency has been generated, details are logged internally, and if it is not remedied within some reasonable time interval, either the user or someone responsible for the security and integrity of the data is notified. Another approach is to conduct consistency checking as a batch operation once a day or less frequently.”

    In other words, the relational model allows us to define consistency constraints but the decision to enforce them in real-time could be considered a physical design decision that is left to the application developer.

  11. | #10

    P.S. A colorful quote directed at the join penalty is: “Using tables to store objects is like driving your car home and then disassembling it to put it in the garage. It can be assembled again in the morning, but one eventually asks whether this is the most efficient way to park a car.” The quote is attributed to Esther Dyson but I have been unable to trace it back to its supposed source. It dates back to the days of “object-oriented databases.” The NoSQL camp is revisiting the same argument.

  12. khailey
    | #11

    @Iggy: great analysis and awesome quote even if it makes me cringe, it’s an awesome quote that I think captures the current startup mentality towards databases and objects

  13. Akmal Chaudhri
  14. khailey
    | #13

    Here is an interesting article on trying to retrieve most recent sorted updates from a friends list
    http://java.dzone.com/articles/mongodb-indexing-tip-1-find

  15. khailey
  16. Ludovico
    | #15

    and the ycombinator responses to the above link https://news.ycombinator.com/item?id=6712703

  17. Th
    | #16

    Facebook uses noSQL db


four + = 11