an Internet weblog, by Bryan Hinton: July 2010

Saturday, July 31, 2010

Agile Data modeling, de-normalization, and other important factors

This subject probably deserves an entire book, but in the essence of keeping this blog post short, I've tried to highlight some of the more important data modeling tasks related to conceptual, logical, and physical data model design. Please note, this is not meant to be exhaustive. Relational models, object-relational models, and object models are not always the right solution for the problem. There are many great solutions available in the NoSQL area, and there is nothing wrong with a hybrid solution.

Software applications that are compiled on specific machine architectures often store and retrieve information from database software systems. Web server platforms often interpret machine-independent script that stores and retrieves information from database software systems. Java bytecode is executed by a Java Virtual Machine, and data is stored and retrieved from database software systems.

Efficient storage and retrieval of information from database software systems are certainly dependent on underlying hardware and operating system software characteristics. However, efficient storage and retrieval of information is also dependent on the structure and efficient execution of query statements that are used to store and retrieve the information. In a purely academic world, data is modeled and normalized in 3NF, 4NF, and 5NF forms. However, higher-order normalized forms do not always promote the creation of optimal query statements. Therefore, certain parts of a data model are often denormalized to improve the read performance of the database software system.

At a high level, joins between multiple, large relations are generally expensive, so denormalization reduces this cost. Reducing the cost of something typically results in inconsistencies or redundancies in the data, so the developer(s) are left with the job of minimizing the amount of redundant data. A formal process typically alleviates this strain.

Defining the I/O characteristics of a database is an important precursor to data modeling. Often times, the I/O characteristics of the database are not defined before modeling the data. Nevertheless, there are a few basic tasks and questions that can help define the I/O characteristics of the database. What is the purpose of the database and what types of problem(s) does the database solve? What types of operations (read, write) will be performed on the database? How many users will the database support? Define user stories - short, succinct 1-3 sentence action statements. Define as many as needed from both the developer perspective and the user perspective. The user stories will help define the conceptual data model for the domain from which the logical data model can be created. A clear definition of the purpose of the system will aid in defining the read/write (I/O) characteristics of the database software system, thus forming a picture of the system. For instance, OLTP databases typically support short transactions, so it is important that write latency is minimized. The answers to the questions above will also help determine the type of hardware storage configuration needed. Another important question when determining the type of hardware configuration is the following. How much downtime can be afforded, and how much failover is needed?

At this point, the types of queries can be seen that will be run on the database with heavy consideration given to the I/O characteristics of the database and the types of joins and scans preferred for the query optimizer to ideally use.

Agile data modeling can be defined as continual iterative feedback with the ability to add entities and relationships, or slice off sections of a data model in various sizes, at any point in the project. This may seem difficult but can be accomplished. Whether the database architecture adheres to the relational model, object-relational model, or object model, the proper classification of entities and attributes, in conjunction with a balance of normalized and de-normalized groups of data, will allow the addition and subtraction of small and large chunks of relations from the data model throughout the development process; for a specific domain. Formal algorithms can be used to map entity relationship models to relations.

And how is this done? Continually modeling entities within the real world. Abstracting classes, re-factoring, keeping the hierarchy semi-flat, avoiding deep polymorphic behavior, and always embracing changes to the data model while never generalizing with a one size fits all approach.

It's important to note that data modeling is a crucial step in designing a database software system that efficiently stores and retrieves information. This process involves defining the I/O characteristics of the database, understanding the purpose of the system, and creating a conceptual data model for the domain. From there, a logical data model can be created, followed by a physical data model that considers the hardware storage configuration needed. Agile data modeling allows for continual iterative feedback, and the proper classification of entities and attributes, in conjunction with a balance of normalized and de-normalized groups of data, will allow for the addition and subtraction of small and large chunks of relations from the data model throughout the development process. Overall, data modeling is a critical aspect of creating an efficient and effective database software system.

Close to Optimal Data Modeling

Agile Data modeling, de-normalization, and other important factors

This subject probably deserves an entire book but in the essence of keeping this blog post short, I've tried to highlight some of the more important data modeling tasks related to conceptual, logical, and physical data model design. Please note, this is not meant to be exhaustive. Relational models, object relational models, and object models are not always the right solution for the problem. There are many great solutions available in the noSQL area and there is nothing wrong with a hybrid solution.

Software applications that are compiled on specific machine architectures often store and retrieve information from database software systems. Web server platforms often interpret machine independent script that stores and retrieves information from database software systems. Java byte code is executed by a Java Virtual Machine and data is stored and retrieved from database software systems.

Efficient storage and retrieval of information from database software systems is certainly dependent on underlying hardware and operating system software characteristics. However; efficient storage and retrieval of information is also dependent on the structure and efficient execution of query statements that are used to store and retrieve the information. In a purely academic world, data is modeled and normalized in 3NF, 4NF, and 5NF forms. However; higher order normalized forms do not always promote the creation of optimal query statements. Therefore, certain parts of a data model are often de-normalized to improve the read performance of the database software system. At a high level, joins between multiple, large relations are generally expensive so de-normalization reduces this cost. Rreducing the cost of something typically results in inconsistencies or redundancies in the data so the developer(s) is left with the job of minimizing the amount of redundant data. Formal process typically alleviates this strain.

Defining the I/O characteristics of a database is an important precursor to data modeling. Often times, the I/O characteristics of the database are not defined before modeling the data. Nevertheless, there are a few basic tasks and questions that can help define the I/O characteristics of the database. What is the purpose of the database and what types of problem(s) does the database solve? What types of operations (read, write) will you be performing on the database? How many users will the database support? Define user stories - short, succinct 1-3 sentence action statements. Define as many as you need from both the developer perspective and the user perspective. The user stories will help define the conceptual data model for the domain from which the logical data model can be created. A clear definition of the purpose of the system will aid in defining the read/write (I/O) characteristics of the database software system. Thus forming a picture of the system. For instance, OLTP databases typically support short transactions so it is important that write latency is minimized. The answers to the questions above will also help determine the type of hardware storage configuration needed. Another important question when determining the type of hardware configuration is the following. How much downtime can we afford and how much failover do we need?

At this point, we can begin to see the types of queries that will be run on the database with heavy consideration given to the I/O characteristics of the database and the types of joins and scans that we would like for the query optimizer to ideally use.

Agile data modeling - Continual, iterative feedback. The ability to add entities and relationships, or slice off sections of a data model in various sizes, at any point in the project. This may seem difficult but can be accomplished. Whether the database architecture adheres to the relational model, object-relational model, or object model, the proper classification of entities and attributes, in conjunction with a balance of normalized and de-normalized groups of data, will allow the addition and subtraction of small and large chunks of relations from the data model throughout the development process; for a specific domain.

Formal algorithms can be used to map entity relationship models to relations.

And how is this done? Continually modeling entities within the real world. Abstracting classes, re-factoring, keeping the hierarchy semi-flat, avoiding deep polymorphic behavior, and always embracing changes to the data model while never generalizing with a one size fits all approach.

Saturday, July 24, 2010

High Performance Database Development

Fast, High performance, database driven, application architectures that support true rapid development are little talked about. AOLServer, OpenACS, PostgreSQL, and TCL are such a combination.

I was introduced to AOLServer and OpenACS in 2002 by a close friend who I grew up with. We built a very robust application architecture over the course of the following 6 years.

PostgreSQL was sitting behind the AOLServer/OpenACS instance and AOLServer nicely managed the database connection pools. AOLServer is multi-threaded and internally handles TCL interpretation and execution. TCL makes heavy use of lists (implementation notes aside), so many LISP programmers have enjoyed working with it.

The OpenACS developer toolbar is a tool that provides developers with access to various debugging and profiling information about their OpenACS application. It is a feature that is built into the OpenACS application framework and can be enabled or disabled depending on the needs of the developer.

Once enabled, the developer toolbar is accessible via a web browser and provides a range of information about the current page, including SQL queries, response times, and server resources. This information can be used to optimize and debug an OpenACS application, allowing developers to quickly identify and fix performance issues.

The OpenACS developer toolbar is just one of the many tools available to developers using the AOLServer, OpenACS, PostgreSQL, and TCL stack. Together, these technologies provide a fast, high-performance, database-driven application architecture that supports rapid development.

Monday, July 19, 2010

Dtrace on FreeBSD 8

Dtrace on FreeBSD 8 - new server with no load

# uname -msri

FreeBSD 8.0-RELEASE-p3 amd64 DEV_A64_KERNEL

From the OpenSolaris DTrace Toolkit

# /usr/local/share/DTTraceToolkit/syscallbypid.d
Tracing... Hit Ctrl-C to end.
  77913 perl                     sigprocmask                    48
  1267 svscan                   stat                           51
 45001 httpd                    read                           52
 26063 python2.4                recvfrom                       61
 26922 sshd                     ioctl                          69
 27104 more                     read                           69
 26888 postgres                 getppid                        78
 26888 postgres                 gettimeofday                   78
 26888 postgres                 select                         78
 26889 postgres                 getppid                        78
 26889 postgres                 select                         78
 45001 httpd                    poll                           81
 27102 postgres                 read                           82
 27103 postgres                 read                           82
 27105 postgres                 read                           82
 27101 dtrace                   clock_gettime                  97
 26063 python2.4                select                        102
 45001 httpd                    writev                        118
 44966 python2.4                read                          121
 26890 postgres                 read                          162
 26063 python2.4                read                          179
 26063 python2.4                _umtx_op                      186
 27104 more                     write                         205
 26063 python2.4                write                         208
 26891 postgres                 write                         300
 27101 dtrace                   ioctl                         301
 26922 sshd                     write                         304
 26922 sshd                     read                          306
 27054 psql                     poll                          378
 27054 psql                     recvfrom                      378
 27055 postgres                 sendto                        379
 26922 sshd                     select                        609
 26922 sshd                     sigprocmask                  1218
 27054 psql                     write                        3203

# /usr/local/share/DTTraceToolkit/rwbypid.d
78168  sshd                        R       39
 97062 squid                       R       58
 45164 httpd                       W       81
 97062 squid                       W       95
 97474 python2.4                   R       98
   555 pflogd                      R      132
 45164 httpd                       R      186

# dtrace -n 'syscall::open*:entry { printf("%s %s",execname,copyinstr(arg0)); }'

CPU     ID                    FUNCTION:NAME
  0  25202                       open:entry svscan .
  0  25202                       open:entry imapd ./cur
  0  25202                       open:entry imapd ./new
  0  25202                       open:entry imapd ./courierimapkeywords/:list
  0  25202                       open:entry imapd ./courierimapkeywords
  0  25202                       open:entry svscan .
  2  25202                       open:entry postgres pg_stat_tmp/pgstat.tmp
  7  25202                       open:entry postgres global/pg_database
  7  25202                       open:entry postgres pg_stat_tmp/pgstat.stat
  7  25202                       open:entry postgres pg_stat_tmp/pgstat.stat
  7  25202                       open:entry postgres pg_stat_tmp/pgstat.stat
  4  25202                       open:entry tinydns data.cdb
  5  25202                       open:entry tinydns data.cdb
  0  25202                       open:entry svscan .
  0  25202                       open:entry svscan .
  0  25202                       open:entry svscan .

Syscalls count by process

# dtrace -n 'syscall:::entry { @num[pid,execname] = count(); }'
 28820  qmailmrtg7                                                      158
    28825  qmailmrtg7                                                      158
    27504  postgres                                                        159
    28776  cron                                                            234
    28853  bash                                                            245
    28861  bash                                                            245
    28869  bash                                                            245
    28877  bash                                                            245
    28798  qmailmrtg7                                                      264
    28777  newsyslog                                                       293
    28772  dtrace                                                          322
    28778  sh                                                              327
    28281  httpd                                                           542
    28900  svn                                                             691
    28903  svn                                                             740
    28902  svn                                                             748
    28904  svn                                                             748
    28901  svn                                                             758
    28780  perl                                                           3023

# Files opened by process
dtrace -n 'syscall::open*:entry { printf("%s %s",execname,copyinstr(arg0)); }'
CPU     ID                    FUNCTION:NAME
  4  25202                       open:entry svscan .
  0  25202                       open:entry squid /usr/local/plone/<domain-name>-dev.com/var/squidstorage/08/0C
  4  25202                       open:entry svscan .
  4  25202                       open:entry svscan .
  4  25202                       open:entry svscan .
  0  25202                       open:entry imapd ./.Spam/tmp/1279591336.M11620P234573_courierlock.dev.<domain-name>.com
  0  25202                       open:entry imapd ./.Spam/tmp/1279591336.M11620P234573_courierlock.dev.<domain-name>.com
  0  25202                       open:entry imapd ./.Spam/tmp
  0  25202                       open:entry imapd ./.Spam/tmp/1279591336.M11867P28573_imapuid_15.dev.<domain-name>.com
  0  25202                       open:entry imapd ./.Spam/courierimapuiddb
  0  25202                       open:entry imapd ./.Spam/tmp/1279591336.M113467P28573_imapuid_15.dev.<domain-name>.com
  0  25202                       open:entry imapd ./.Spam/cur
  0  25202                       open:entry imapd ./.Spam/new
  0  25202                       open:entry imapd ./.Spam/courierimapkeywords/:list
  0  25202                       open:entry imapd ./.Spam/courierimapkeywords
  0  25202                       open:entry squid /usr/local/plone/abcdexcrfdsf-dev.com/var/squidstorage/09/0C
  4  25202                       open:entry svscan .