This subject probably deserves an entire book, but in the essence of keeping this blog post short, I've tried to highlight some of the more important data modeling tasks related to conceptual, logical, and physical data model design. Please note, this is not meant to be exhaustive. Relational models, object-relational models, and object models are not always the right solution for the problem. There are many great solutions available in the NoSQL area, and there is nothing wrong with a hybrid solution.
Software applications that are compiled on specific machine architectures often store and retrieve information from database software systems. Web server platforms often interpret machine-independent script that stores and retrieves information from database software systems. Java bytecode is executed by a Java Virtual Machine, and data is stored and retrieved from database software systems.
Efficient storage and retrieval of information from database software systems are certainly dependent on underlying hardware and operating system software characteristics. However, efficient storage and retrieval of information is also dependent on the structure and efficient execution of query statements that are used to store and retrieve the information. In a purely academic world, data is modeled and normalized in 3NF, 4NF, and 5NF forms. However, higher-order normalized forms do not always promote the creation of optimal query statements. Therefore, certain parts of a data model are often denormalized to improve the read performance of the database software system.
At a high level, joins between multiple, large relations are generally expensive, so denormalization reduces this cost. Reducing the cost of something typically results in inconsistencies or redundancies in the data, so the developer(s) are left with the job of minimizing the amount of redundant data. A formal process typically alleviates this strain.
Defining the I/O characteristics of a database is an important precursor to data modeling. Often times, the I/O characteristics of the database are not defined before modeling the data. Nevertheless, there are a few basic tasks and questions that can help define the I/O characteristics of the database. What is the purpose of the database and what types of problem(s) does the database solve? What types of operations (read, write) will be performed on the database? How many users will the database support? Define user stories - short, succinct 1-3 sentence action statements. Define as many as needed from both the developer perspective and the user perspective. The user stories will help define the conceptual data model for the domain from which the logical data model can be created. A clear definition of the purpose of the system will aid in defining the read/write (I/O) characteristics of the database software system, thus forming a picture of the system. For instance, OLTP databases typically support short transactions, so it is important that write latency is minimized. The answers to the questions above will also help determine the type of hardware storage configuration needed. Another important question when determining the type of hardware configuration is the following. How much downtime can be afforded, and how much failover is needed?
At this point, the types of queries can be seen that will be run on the database with heavy consideration given to the I/O characteristics of the database and the types of joins and scans preferred for the query optimizer to ideally use.
Agile data modeling can be defined as continual iterative feedback with the ability to add entities and relationships, or slice off sections of a data model in various sizes, at any point in the project. This may seem difficult but can be accomplished. Whether the database architecture adheres to the relational model, object-relational model, or object model, the proper classification of entities and attributes, in conjunction with a balance of normalized and de-normalized groups of data, will allow the addition and subtraction of small and large chunks of relations from the data model throughout the development process; for a specific domain. Formal algorithms can be used to map entity relationship models to relations.
And how is this done? Continually modeling entities within the real world. Abstracting classes, re-factoring, keeping the hierarchy semi-flat, avoiding deep polymorphic behavior, and always embracing changes to the data model while never generalizing with a one size fits all approach.
It's important to note that data modeling is a crucial step in designing a database software system that efficiently stores and retrieves information. This process involves defining the I/O characteristics of the database, understanding the purpose of the system, and creating a conceptual data model for the domain. From there, a logical data model can be created, followed by a physical data model that considers the hardware storage configuration needed. Agile data modeling allows for continual iterative feedback, and the proper classification of entities and attributes, in conjunction with a balance of normalized and de-normalized groups of data, will allow for the addition and subtraction of small and large chunks of relations from the data model throughout the development process. Overall, data modeling is a critical aspect of creating an efficient and effective database software system.
Saturday, July 31, 2010
Close to Optimal Data Modeling
Agile Data modeling, de-normalization, and other important factors
This subject probably deserves an entire book but in the essence of keeping this blog post short, I've tried to highlight some of the more important data modeling tasks related to conceptual, logical, and physical data model design. Please note, this is not meant to be exhaustive. Relational models, object relational models, and object models are not always the right solution for the problem. There are many great solutions available in the noSQL area and there is nothing wrong with a hybrid solution.
Software applications that are compiled on specific machine architectures often store and retrieve information from database software systems. Web server platforms often interpret machine independent script that stores and retrieves information from database software systems. Java byte code is executed by a Java Virtual Machine and data is stored and retrieved from database software systems.
Efficient storage and retrieval of information from database software systems is certainly dependent on underlying hardware and operating system software characteristics. However; efficient storage and retrieval of information is also dependent on the structure and efficient execution of query statements that are used to store and retrieve the information. In a purely academic world, data is modeled and normalized in 3NF, 4NF, and 5NF forms. However; higher order normalized forms do not always promote the creation of optimal query statements. Therefore, certain parts of a data model are often de-normalized to improve the read performance of the database software system. At a high level, joins between multiple, large relations are generally expensive so de-normalization reduces this cost. Rreducing the cost of something typically results in inconsistencies or redundancies in the data so the developer(s) is left with the job of minimizing the amount of redundant data. Formal process typically alleviates this strain.
Defining the I/O characteristics of a database is an important precursor to data modeling. Often times, the I/O characteristics of the database are not defined before modeling the data. Nevertheless, there are a few basic tasks and questions that can help define the I/O characteristics of the database. What is the purpose of the database and what types of problem(s) does the database solve? What types of operations (read, write) will you be performing on the database? How many users will the database support? Define user stories - short, succinct 1-3 sentence action statements. Define as many as you need from both the developer perspective and the user perspective. The user stories will help define the conceptual data model for the domain from which the logical data model can be created. A clear definition of the purpose of the system will aid in defining the read/write (I/O) characteristics of the database software system. Thus forming a picture of the system. For instance, OLTP databases typically support short transactions so it is important that write latency is minimized. The answers to the questions above will also help determine the type of hardware storage configuration needed. Another important question when determining the type of hardware configuration is the following. How much downtime can we afford and how much failover do we need?
At this point, we can begin to see the types of queries that will be run on the database with heavy consideration given to the I/O characteristics of the database and the types of joins and scans that we would like for the query optimizer to ideally use.
Agile data modeling - Continual, iterative feedback. The ability to add entities and relationships, or slice off sections of a data model in various sizes, at any point in the project. This may seem difficult but can be accomplished. Whether the database architecture adheres to the relational model, object-relational model, or object model, the proper classification of entities and attributes, in conjunction with a balance of normalized and de-normalized groups of data, will allow the addition and subtraction of small and large chunks of relations from the data model throughout the development process; for a specific domain.
Formal algorithms can be used to map entity relationship models to relations.
And how is this done? Continually modeling entities within the real world. Abstracting classes, re-factoring, keeping the hierarchy semi-flat, avoiding deep polymorphic behavior, and always embracing changes to the data model while never generalizing with a one size fits all approach.
This subject probably deserves an entire book but in the essence of keeping this blog post short, I've tried to highlight some of the more important data modeling tasks related to conceptual, logical, and physical data model design. Please note, this is not meant to be exhaustive. Relational models, object relational models, and object models are not always the right solution for the problem. There are many great solutions available in the noSQL area and there is nothing wrong with a hybrid solution.
Software applications that are compiled on specific machine architectures often store and retrieve information from database software systems. Web server platforms often interpret machine independent script that stores and retrieves information from database software systems. Java byte code is executed by a Java Virtual Machine and data is stored and retrieved from database software systems.
Efficient storage and retrieval of information from database software systems is certainly dependent on underlying hardware and operating system software characteristics. However; efficient storage and retrieval of information is also dependent on the structure and efficient execution of query statements that are used to store and retrieve the information. In a purely academic world, data is modeled and normalized in 3NF, 4NF, and 5NF forms. However; higher order normalized forms do not always promote the creation of optimal query statements. Therefore, certain parts of a data model are often de-normalized to improve the read performance of the database software system. At a high level, joins between multiple, large relations are generally expensive so de-normalization reduces this cost. Rreducing the cost of something typically results in inconsistencies or redundancies in the data so the developer(s) is left with the job of minimizing the amount of redundant data. Formal process typically alleviates this strain.
Defining the I/O characteristics of a database is an important precursor to data modeling. Often times, the I/O characteristics of the database are not defined before modeling the data. Nevertheless, there are a few basic tasks and questions that can help define the I/O characteristics of the database. What is the purpose of the database and what types of problem(s) does the database solve? What types of operations (read, write) will you be performing on the database? How many users will the database support? Define user stories - short, succinct 1-3 sentence action statements. Define as many as you need from both the developer perspective and the user perspective. The user stories will help define the conceptual data model for the domain from which the logical data model can be created. A clear definition of the purpose of the system will aid in defining the read/write (I/O) characteristics of the database software system. Thus forming a picture of the system. For instance, OLTP databases typically support short transactions so it is important that write latency is minimized. The answers to the questions above will also help determine the type of hardware storage configuration needed. Another important question when determining the type of hardware configuration is the following. How much downtime can we afford and how much failover do we need?
At this point, we can begin to see the types of queries that will be run on the database with heavy consideration given to the I/O characteristics of the database and the types of joins and scans that we would like for the query optimizer to ideally use.
Agile data modeling - Continual, iterative feedback. The ability to add entities and relationships, or slice off sections of a data model in various sizes, at any point in the project. This may seem difficult but can be accomplished. Whether the database architecture adheres to the relational model, object-relational model, or object model, the proper classification of entities and attributes, in conjunction with a balance of normalized and de-normalized groups of data, will allow the addition and subtraction of small and large chunks of relations from the data model throughout the development process; for a specific domain.
Formal algorithms can be used to map entity relationship models to relations.
And how is this done? Continually modeling entities within the real world. Abstracting classes, re-factoring, keeping the hierarchy semi-flat, avoiding deep polymorphic behavior, and always embracing changes to the data model while never generalizing with a one size fits all approach.
Saturday, July 24, 2010
High Performance Database Development
Fast, High performance, database driven, application architectures that support true rapid development are little talked about. AOLServer, OpenACS, PostgreSQL, and TCL are such a combination.
I was introduced to AOLServer and OpenACS in 2002 by a close friend who I grew up with. We built a very robust application architecture over the course of the following 6 years.
PostgreSQL was sitting behind the AOLServer/OpenACS instance and AOLServer nicely managed the database connection pools. AOLServer is multi-threaded and internally handles TCL interpretation and execution. TCL makes heavy use of lists (implementation notes aside), so many LISP programmers have enjoyed working with it.
The OpenACS developer toolbar is a tool that provides developers with access to various debugging and profiling information about their OpenACS application. It is a feature that is built into the OpenACS application framework and can be enabled or disabled depending on the needs of the developer.
Once enabled, the developer toolbar is accessible via a web browser and provides a range of information about the current page, including SQL queries, response times, and server resources. This information can be used to optimize and debug an OpenACS application, allowing developers to quickly identify and fix performance issues.
The OpenACS developer toolbar is just one of the many tools available to developers using the AOLServer, OpenACS, PostgreSQL, and TCL stack. Together, these technologies provide a fast, high-performance, database-driven application architecture that supports rapid development.
I was introduced to AOLServer and OpenACS in 2002 by a close friend who I grew up with. We built a very robust application architecture over the course of the following 6 years.
PostgreSQL was sitting behind the AOLServer/OpenACS instance and AOLServer nicely managed the database connection pools. AOLServer is multi-threaded and internally handles TCL interpretation and execution. TCL makes heavy use of lists (implementation notes aside), so many LISP programmers have enjoyed working with it.
The OpenACS developer toolbar is a tool that provides developers with access to various debugging and profiling information about their OpenACS application. It is a feature that is built into the OpenACS application framework and can be enabled or disabled depending on the needs of the developer.
Once enabled, the developer toolbar is accessible via a web browser and provides a range of information about the current page, including SQL queries, response times, and server resources. This information can be used to optimize and debug an OpenACS application, allowing developers to quickly identify and fix performance issues.
The OpenACS developer toolbar is just one of the many tools available to developers using the AOLServer, OpenACS, PostgreSQL, and TCL stack. Together, these technologies provide a fast, high-performance, database-driven application architecture that supports rapid development.
Monday, July 19, 2010
Dtrace on FreeBSD 8
Dtrace on FreeBSD 8 - new server with no load
# uname -msri
FreeBSD 8.0-RELEASE-p3 amd64 DEV_A64_KERNEL
# /usr/local/share/DTTraceToolkit/syscallbypid.d Tracing... Hit Ctrl-C to end. 77913 perl sigprocmask 48 1267 svscan stat 51 45001 httpd read 52 26063 python2.4 recvfrom 61 26922 sshd ioctl 69 27104 more read 69 26888 postgres getppid 78 26888 postgres gettimeofday 78 26888 postgres select 78 26889 postgres getppid 78 26889 postgres select 78 45001 httpd poll 81 27102 postgres read 82 27103 postgres read 82 27105 postgres read 82 27101 dtrace clock_gettime 97 26063 python2.4 select 102 45001 httpd writev 118 44966 python2.4 read 121 26890 postgres read 162 26063 python2.4 read 179 26063 python2.4 _umtx_op 186 27104 more write 205 26063 python2.4 write 208 26891 postgres write 300 27101 dtrace ioctl 301 26922 sshd write 304 26922 sshd read 306 27054 psql poll 378 27054 psql recvfrom 378 27055 postgres sendto 379 26922 sshd select 609 26922 sshd sigprocmask 1218 27054 psql write 3203
# /usr/local/share/DTTraceToolkit/rwbypid.d 78168 sshd R 39 97062 squid R 58 45164 httpd W 81 97062 squid W 95 97474 python2.4 R 98 555 pflogd R 132 45164 httpd R 186
# dtrace -n 'syscall::open*:entry { printf("%s %s",execname,copyinstr(arg0)); }'
CPU     ID                    FUNCTION:NAME
  0  25202                       open:entry svscan .
  0  25202                       open:entry imapd ./cur
  0  25202                       open:entry imapd ./new
  0  25202                       open:entry imapd ./courierimapkeywords/:list
  0  25202                       open:entry imapd ./courierimapkeywords
  0  25202                       open:entry svscan .
  2  25202                       open:entry postgres pg_stat_tmp/pgstat.tmp
  7  25202                       open:entry postgres global/pg_database
  7  25202                       open:entry postgres pg_stat_tmp/pgstat.stat
  7  25202                       open:entry postgres pg_stat_tmp/pgstat.stat
  7  25202                       open:entry postgres pg_stat_tmp/pgstat.stat
  4  25202                       open:entry tinydns data.cdb
  5  25202                       open:entry tinydns data.cdb
  0  25202                       open:entry svscan .
  0  25202                       open:entry svscan .
  0  25202                       open:entry svscan .Syscalls count by process# dtrace -n 'syscall:::entry { @num[pid,execname] = count(); }'
 28820  qmailmrtg7                                                      158
    28825  qmailmrtg7                                                      158
    27504  postgres                                                        159
    28776  cron                                                            234
    28853  bash                                                            245
    28861  bash                                                            245
    28869  bash                                                            245
    28877  bash                                                            245
    28798  qmailmrtg7                                                      264
    28777  newsyslog                                                       293
    28772  dtrace                                                          322
    28778  sh                                                              327
    28281  httpd                                                           542
    28900  svn                                                             691
    28903  svn                                                             740
    28902  svn                                                             748
    28904  svn                                                             748
    28901  svn                                                             758
    28780  perl                                                           3023
# Files opened by process
dtrace -n 'syscall::open*:entry { printf("%s %s",execname,copyinstr(arg0)); }'
CPU     ID                    FUNCTION:NAME
  4  25202                       open:entry svscan .
  0  25202                       open:entry squid /usr/local/plone/<domain-name>-dev.com/var/squidstorage/08/0C
  4  25202                       open:entry svscan .
  4  25202                       open:entry svscan .
  4  25202                       open:entry svscan .
  0  25202                       open:entry imapd ./.Spam/tmp/1279591336.M11620P234573_courierlock.dev.<domain-name>.com
  0  25202                       open:entry imapd ./.Spam/tmp/1279591336.M11620P234573_courierlock.dev.<domain-name>.com
  0  25202                       open:entry imapd ./.Spam/tmp
  0  25202                       open:entry imapd ./.Spam/tmp/1279591336.M11867P28573_imapuid_15.dev.<domain-name>.com
  0  25202                       open:entry imapd ./.Spam/courierimapuiddb
  0  25202                       open:entry imapd ./.Spam/tmp/1279591336.M113467P28573_imapuid_15.dev.<domain-name>.com
  0  25202                       open:entry imapd ./.Spam/cur
  0  25202                       open:entry imapd ./.Spam/new
  0  25202                       open:entry imapd ./.Spam/courierimapkeywords/:list
  0  25202                       open:entry imapd ./.Spam/courierimapkeywords
  0  25202                       open:entry squid /usr/local/plone/abcdexcrfdsf-dev.com/var/squidstorage/09/0C
  4  25202                       open:entry svscan .
Subscribe to:
Comments (Atom)