This subject probably deserves an entire book, but in the essence of keeping this blog post short, I've tried to highlight some of the more important data modeling tasks related to conceptual, logical, and physical data model design. Please note, this is not meant to be exhaustive. Relational models, object-relational models, and object models are not always the right solution for the problem. There are many great solutions available in the NoSQL area, and there is nothing wrong with a hybrid solution.
Software applications that are compiled on specific machine architectures often store and retrieve information from database software systems. Web server platforms often interpret machine-independent script that stores and retrieves information from database software systems. Java bytecode is executed by a Java Virtual Machine, and data is stored and retrieved from database software systems.
Efficient storage and retrieval of information from database software systems are certainly dependent on underlying hardware and operating system software characteristics. However, efficient storage and retrieval of information is also dependent on the structure and efficient execution of query statements that are used to store and retrieve the information. In a purely academic world, data is modeled and normalized in 3NF, 4NF, and 5NF forms. However, higher-order normalized forms do not always promote the creation of optimal query statements. Therefore, certain parts of a data model are often denormalized to improve the read performance of the database software system.
At a high level, joins between multiple, large relations are generally expensive, so denormalization reduces this cost. Reducing the cost of something typically results in inconsistencies or redundancies in the data, so the developer(s) are left with the job of minimizing the amount of redundant data. A formal process typically alleviates this strain.
Defining the I/O characteristics of a database is an important precursor to data modeling. Often times, the I/O characteristics of the database are not defined before modeling the data. Nevertheless, there are a few basic tasks and questions that can help define the I/O characteristics of the database. What is the purpose of the database and what types of problem(s) does the database solve? What types of operations (read, write) will be performed on the database? How many users will the database support? Define user stories - short, succinct 1-3 sentence action statements. Define as many as needed from both the developer perspective and the user perspective. The user stories will help define the conceptual data model for the domain from which the logical data model can be created. A clear definition of the purpose of the system will aid in defining the read/write (I/O) characteristics of the database software system, thus forming a picture of the system. For instance, OLTP databases typically support short transactions, so it is important that write latency is minimized. The answers to the questions above will also help determine the type of hardware storage configuration needed. Another important question when determining the type of hardware configuration is the following. How much downtime can be afforded, and how much failover is needed?
At this point, the types of queries can be seen that will be run on the database with heavy consideration given to the I/O characteristics of the database and the types of joins and scans preferred for the query optimizer to ideally use.
Agile data modeling can be defined as continual iterative feedback with the ability to add entities and relationships, or slice off sections of a data model in various sizes, at any point in the project. This may seem difficult but can be accomplished. Whether the database architecture adheres to the relational model, object-relational model, or object model, the proper classification of entities and attributes, in conjunction with a balance of normalized and de-normalized groups of data, will allow the addition and subtraction of small and large chunks of relations from the data model throughout the development process; for a specific domain. Formal algorithms can be used to map entity relationship models to relations.
And how is this done? Continually modeling entities within the real world. Abstracting classes, re-factoring, keeping the hierarchy semi-flat, avoiding deep polymorphic behavior, and always embracing changes to the data model while never generalizing with a one size fits all approach.
It's important to note that data modeling is a crucial step in designing a database software system that efficiently stores and retrieves information. This process involves defining the I/O characteristics of the database, understanding the purpose of the system, and creating a conceptual data model for the domain. From there, a logical data model can be created, followed by a physical data model that considers the hardware storage configuration needed. Agile data modeling allows for continual iterative feedback, and the proper classification of entities and attributes, in conjunction with a balance of normalized and de-normalized groups of data, will allow for the addition and subtraction of small and large chunks of relations from the data model throughout the development process. Overall, data modeling is a critical aspect of creating an efficient and effective database software system.
Saturday, July 31, 2010
Close to Optimal Data Modeling
Agile Data modeling, de-normalization, and other important factors
This subject probably deserves an entire book but in the essence of keeping this blog post short, I've tried to highlight some of the more important data modeling tasks related to conceptual, logical, and physical data model design. Please note, this is not meant to be exhaustive. Relational models, object relational models, and object models are not always the right solution for the problem. There are many great solutions available in the noSQL area and there is nothing wrong with a hybrid solution.
Software applications that are compiled on specific machine architectures often store and retrieve information from database software systems. Web server platforms often interpret machine independent script that stores and retrieves information from database software systems. Java byte code is executed by a Java Virtual Machine and data is stored and retrieved from database software systems.
Efficient storage and retrieval of information from database software systems is certainly dependent on underlying hardware and operating system software characteristics. However; efficient storage and retrieval of information is also dependent on the structure and efficient execution of query statements that are used to store and retrieve the information. In a purely academic world, data is modeled and normalized in 3NF, 4NF, and 5NF forms. However; higher order normalized forms do not always promote the creation of optimal query statements. Therefore, certain parts of a data model are often de-normalized to improve the read performance of the database software system. At a high level, joins between multiple, large relations are generally expensive so de-normalization reduces this cost. Rreducing the cost of something typically results in inconsistencies or redundancies in the data so the developer(s) is left with the job of minimizing the amount of redundant data. Formal process typically alleviates this strain.
Defining the I/O characteristics of a database is an important precursor to data modeling. Often times, the I/O characteristics of the database are not defined before modeling the data. Nevertheless, there are a few basic tasks and questions that can help define the I/O characteristics of the database. What is the purpose of the database and what types of problem(s) does the database solve? What types of operations (read, write) will you be performing on the database? How many users will the database support? Define user stories - short, succinct 1-3 sentence action statements. Define as many as you need from both the developer perspective and the user perspective. The user stories will help define the conceptual data model for the domain from which the logical data model can be created. A clear definition of the purpose of the system will aid in defining the read/write (I/O) characteristics of the database software system. Thus forming a picture of the system. For instance, OLTP databases typically support short transactions so it is important that write latency is minimized. The answers to the questions above will also help determine the type of hardware storage configuration needed. Another important question when determining the type of hardware configuration is the following. How much downtime can we afford and how much failover do we need?
At this point, we can begin to see the types of queries that will be run on the database with heavy consideration given to the I/O characteristics of the database and the types of joins and scans that we would like for the query optimizer to ideally use.
Agile data modeling - Continual, iterative feedback. The ability to add entities and relationships, or slice off sections of a data model in various sizes, at any point in the project. This may seem difficult but can be accomplished. Whether the database architecture adheres to the relational model, object-relational model, or object model, the proper classification of entities and attributes, in conjunction with a balance of normalized and de-normalized groups of data, will allow the addition and subtraction of small and large chunks of relations from the data model throughout the development process; for a specific domain.
Formal algorithms can be used to map entity relationship models to relations.
And how is this done? Continually modeling entities within the real world. Abstracting classes, re-factoring, keeping the hierarchy semi-flat, avoiding deep polymorphic behavior, and always embracing changes to the data model while never generalizing with a one size fits all approach.
This subject probably deserves an entire book but in the essence of keeping this blog post short, I've tried to highlight some of the more important data modeling tasks related to conceptual, logical, and physical data model design. Please note, this is not meant to be exhaustive. Relational models, object relational models, and object models are not always the right solution for the problem. There are many great solutions available in the noSQL area and there is nothing wrong with a hybrid solution.
Software applications that are compiled on specific machine architectures often store and retrieve information from database software systems. Web server platforms often interpret machine independent script that stores and retrieves information from database software systems. Java byte code is executed by a Java Virtual Machine and data is stored and retrieved from database software systems.
Efficient storage and retrieval of information from database software systems is certainly dependent on underlying hardware and operating system software characteristics. However; efficient storage and retrieval of information is also dependent on the structure and efficient execution of query statements that are used to store and retrieve the information. In a purely academic world, data is modeled and normalized in 3NF, 4NF, and 5NF forms. However; higher order normalized forms do not always promote the creation of optimal query statements. Therefore, certain parts of a data model are often de-normalized to improve the read performance of the database software system. At a high level, joins between multiple, large relations are generally expensive so de-normalization reduces this cost. Rreducing the cost of something typically results in inconsistencies or redundancies in the data so the developer(s) is left with the job of minimizing the amount of redundant data. Formal process typically alleviates this strain.
Defining the I/O characteristics of a database is an important precursor to data modeling. Often times, the I/O characteristics of the database are not defined before modeling the data. Nevertheless, there are a few basic tasks and questions that can help define the I/O characteristics of the database. What is the purpose of the database and what types of problem(s) does the database solve? What types of operations (read, write) will you be performing on the database? How many users will the database support? Define user stories - short, succinct 1-3 sentence action statements. Define as many as you need from both the developer perspective and the user perspective. The user stories will help define the conceptual data model for the domain from which the logical data model can be created. A clear definition of the purpose of the system will aid in defining the read/write (I/O) characteristics of the database software system. Thus forming a picture of the system. For instance, OLTP databases typically support short transactions so it is important that write latency is minimized. The answers to the questions above will also help determine the type of hardware storage configuration needed. Another important question when determining the type of hardware configuration is the following. How much downtime can we afford and how much failover do we need?
At this point, we can begin to see the types of queries that will be run on the database with heavy consideration given to the I/O characteristics of the database and the types of joins and scans that we would like for the query optimizer to ideally use.
Agile data modeling - Continual, iterative feedback. The ability to add entities and relationships, or slice off sections of a data model in various sizes, at any point in the project. This may seem difficult but can be accomplished. Whether the database architecture adheres to the relational model, object-relational model, or object model, the proper classification of entities and attributes, in conjunction with a balance of normalized and de-normalized groups of data, will allow the addition and subtraction of small and large chunks of relations from the data model throughout the development process; for a specific domain.
Formal algorithms can be used to map entity relationship models to relations.
And how is this done? Continually modeling entities within the real world. Abstracting classes, re-factoring, keeping the hierarchy semi-flat, avoiding deep polymorphic behavior, and always embracing changes to the data model while never generalizing with a one size fits all approach.
Saturday, July 24, 2010
High Performance Database Development
Fast, High performance, database driven, application architectures that support true rapid development are little talked about. AOLServer, OpenACS, PostgreSQL, and TCL are such a combination.
I was introduced to AOLServer and OpenACS in 2002 by a close friend who I grew up with. We built a very robust application architecture over the course of the following 6 years.
PostgreSQL was sitting behind the AOLServer/OpenACS instance and AOLServer nicely managed the database connection pools. AOLServer is multi-threaded and internally handles TCL interpretation and execution. TCL makes heavy use of lists (implementation notes aside), so many LISP programmers have enjoyed working with it.
The OpenACS developer toolbar is a tool that provides developers with access to various debugging and profiling information about their OpenACS application. It is a feature that is built into the OpenACS application framework and can be enabled or disabled depending on the needs of the developer.
Once enabled, the developer toolbar is accessible via a web browser and provides a range of information about the current page, including SQL queries, response times, and server resources. This information can be used to optimize and debug an OpenACS application, allowing developers to quickly identify and fix performance issues.
The OpenACS developer toolbar is just one of the many tools available to developers using the AOLServer, OpenACS, PostgreSQL, and TCL stack. Together, these technologies provide a fast, high-performance, database-driven application architecture that supports rapid development.
I was introduced to AOLServer and OpenACS in 2002 by a close friend who I grew up with. We built a very robust application architecture over the course of the following 6 years.
PostgreSQL was sitting behind the AOLServer/OpenACS instance and AOLServer nicely managed the database connection pools. AOLServer is multi-threaded and internally handles TCL interpretation and execution. TCL makes heavy use of lists (implementation notes aside), so many LISP programmers have enjoyed working with it.
The OpenACS developer toolbar is a tool that provides developers with access to various debugging and profiling information about their OpenACS application. It is a feature that is built into the OpenACS application framework and can be enabled or disabled depending on the needs of the developer.
Once enabled, the developer toolbar is accessible via a web browser and provides a range of information about the current page, including SQL queries, response times, and server resources. This information can be used to optimize and debug an OpenACS application, allowing developers to quickly identify and fix performance issues.
The OpenACS developer toolbar is just one of the many tools available to developers using the AOLServer, OpenACS, PostgreSQL, and TCL stack. Together, these technologies provide a fast, high-performance, database-driven application architecture that supports rapid development.
Monday, July 19, 2010
Dtrace on FreeBSD 8
Dtrace on FreeBSD 8 - new server with no load
# uname -msri
FreeBSD 8.0-RELEASE-p3 amd64 DEV_A64_KERNEL
# /usr/local/share/DTTraceToolkit/syscallbypid.d Tracing... Hit Ctrl-C to end. 77913 perl sigprocmask 48 1267 svscan stat 51 45001 httpd read 52 26063 python2.4 recvfrom 61 26922 sshd ioctl 69 27104 more read 69 26888 postgres getppid 78 26888 postgres gettimeofday 78 26888 postgres select 78 26889 postgres getppid 78 26889 postgres select 78 45001 httpd poll 81 27102 postgres read 82 27103 postgres read 82 27105 postgres read 82 27101 dtrace clock_gettime 97 26063 python2.4 select 102 45001 httpd writev 118 44966 python2.4 read 121 26890 postgres read 162 26063 python2.4 read 179 26063 python2.4 _umtx_op 186 27104 more write 205 26063 python2.4 write 208 26891 postgres write 300 27101 dtrace ioctl 301 26922 sshd write 304 26922 sshd read 306 27054 psql poll 378 27054 psql recvfrom 378 27055 postgres sendto 379 26922 sshd select 609 26922 sshd sigprocmask 1218 27054 psql write 3203
# /usr/local/share/DTTraceToolkit/rwbypid.d 78168 sshd R 39 97062 squid R 58 45164 httpd W 81 97062 squid W 95 97474 python2.4 R 98 555 pflogd R 132 45164 httpd R 186
# dtrace -n 'syscall::open*:entry { printf("%s %s",execname,copyinstr(arg0)); }' CPU ID FUNCTION:NAME 0 25202 open:entry svscan . 0 25202 open:entry imapd ./cur 0 25202 open:entry imapd ./new 0 25202 open:entry imapd ./courierimapkeywords/:list 0 25202 open:entry imapd ./courierimapkeywords 0 25202 open:entry svscan . 2 25202 open:entry postgres pg_stat_tmp/pgstat.tmp 7 25202 open:entry postgres global/pg_database 7 25202 open:entry postgres pg_stat_tmp/pgstat.stat 7 25202 open:entry postgres pg_stat_tmp/pgstat.stat 7 25202 open:entry postgres pg_stat_tmp/pgstat.stat 4 25202 open:entry tinydns data.cdb 5 25202 open:entry tinydns data.cdb 0 25202 open:entry svscan . 0 25202 open:entry svscan . 0 25202 open:entry svscan .Syscalls count by process
# dtrace -n 'syscall:::entry { @num[pid,execname] = count(); }' 28820 qmailmrtg7 158 28825 qmailmrtg7 158 27504 postgres 159 28776 cron 234 28853 bash 245 28861 bash 245 28869 bash 245 28877 bash 245 28798 qmailmrtg7 264 28777 newsyslog 293 28772 dtrace 322 28778 sh 327 28281 httpd 542 28900 svn 691 28903 svn 740 28902 svn 748 28904 svn 748 28901 svn 758 28780 perl 3023
# Files opened by process dtrace -n 'syscall::open*:entry { printf("%s %s",execname,copyinstr(arg0)); }' CPU ID FUNCTION:NAME 4 25202 open:entry svscan . 0 25202 open:entry squid /usr/local/plone/<domain-name>-dev.com/var/squidstorage/08/0C 4 25202 open:entry svscan . 4 25202 open:entry svscan . 4 25202 open:entry svscan . 0 25202 open:entry imapd ./.Spam/tmp/1279591336.M11620P234573_courierlock.dev.<domain-name>.com
0 25202 open:entry imapd ./.Spam/tmp/1279591336.M11620P234573_courierlock.dev.<domain-name>.com
0 25202 open:entry imapd ./.Spam/tmp 0 25202 open:entry imapd ./.Spam/tmp/1279591336.M11867P28573_imapuid_15.dev.<domain-name>.com
0 25202 open:entry imapd ./.Spam/courierimapuiddb 0 25202 open:entry imapd ./.Spam/tmp/1279591336.M113467P28573_imapuid_15.dev.<domain-name>.com 0 25202 open:entry imapd ./.Spam/cur 0 25202 open:entry imapd ./.Spam/new 0 25202 open:entry imapd ./.Spam/courierimapkeywords/:list 0 25202 open:entry imapd ./.Spam/courierimapkeywords 0 25202 open:entry squid /usr/local/plone/abcdexcrfdsf-dev.com/var/squidstorage/09/0C 4 25202 open:entry svscan .
Subscribe to:
Posts (Atom)