123
 123

Tip: 看不到本站引用 Flickr 的图片? 下载 Firefox Access Flickr 插件 | AD: 订阅 DBA notes --

2010-06-16 Wed

21:23 Oracle UTF8 Encoding and String Manipulation Overhead (3947 Bytes) » The Pythian Blog

For one of our customers, I’ve recently reviewed the strategy of migration from single-byte encoding to variable length multi-byte UTF8 (AL32UTF8 encoding in Oracle naming standards). These type of projects are coming up again and again so I think it must be common for many of you. Thus, this bit might be useful. I’m also interested in your experience – perhaps you can run this simple simulation on your platforms and provide the results in the comments?

Back to the project… One area was estimation of the string manipulation overhead. Based on Pythian experience as well as the feedback from my network, I could conclude that nobody has observed any measurable performance degradation or significant CPU usage growth clearly attributed to UTF8 migration.

Thus, I decided to simulate at least some operations and measure them. One of the concerns was sub-string extraction — fixed length encoding sub-string is super easy to implement as the offset and length in bytes are known. Variable length character set would require scanning the whole string from the very beginning because byte offset and byte length are not known until the string is traversed from the beginning character by character.

This is the PL/SQL block I came up with:

declare
  s1 varchar2(4000 char);
  s2 varchar2(4000 char);
  i INTEGER;
begin
  s1 := rpad('a',3999,'a');
  for i in 1..10000000 loop
    null;
    --s2 := 'a' || 'b' || 'c' || 'd';
    --s2 := substr(s1,3000,1) || substr(s1,3001,1) || substr(s1,3002,1) || substr(s1,3003,1);
  end loop;
  dbms_output.put_line(s2);
end;

There are 3 cases and only one of them needs to be un-commented when running the block. null; is there to just capture the baseline and measure time required to run an empty loop. Concatenation s2 := 'a' || 'b' || 'c' || 'd'; is one scenario and, finally, substring extraction s2 := substr(s1,3000,1) || ... is the second one.

I’ve run it in the VMs on my laptop as well as customer’s own physical servers to confirm the timing. Below is the result from my virtual machines but the percentage was pretty much the same. Platform is Linux. Oracle version 10.2 and 11.1. Note that 11.2 is smarter and optimizes the block with NULL loop so you would need to trick PL/SQL optimizer better.

Character set null concat substr
WE8ISO8859P1 0.5 1.8 8.0
AL32UTF8 0.5 2.2 9.9

The time in the table is seconds that PL/SQL block was executing. I ran if few times and averaged the result but it was very consistent anyway. Also, skip the very first measurement to avoid the impact of block parsing.

To interpret the results, the string manipulation/concatenation (excluding 0.5s of looping itself) accounts 1.3s vs 1.7s – 30% UTF8 overhead. SUBSTR function overhead (excluding loop and concat itself) accounts for 6.2s vs 7.7s which is only 25% overhead.

I found substring overhead rather small – I expected order of magnitude difference to be honest. However, 30% of concatenation overhead seemed to be little too much and I don’t see why it should be that high.

Now, while overhead is rather noticeable, I didn’t see databases that were doing mostly string manipulations in SQL and PL/SQL. I think that’s why in the big picture, we generally don’t see much performance impact moving to variable-length character set.

If you have access to the databases with fixed-length and variable-length character sets on the similar hardware, feel free to post your results below.

17:25 Redis API Exposed as Web Services (1005 Bytes) » myNoSQL
Redis API Exposed as Web Services:

Redis API exposed as web services using ☞ Service Stack an open source .NET and Mono web services framework. Might give you an idea how to build your own Redis-as-a-Service.

09:35 Is NoSQL Just for a Small Niche? (2425 Bytes) » myNoSQL
Is NoSQL Just for a Small Niche?:

NoSQL is not a bad idea, but it fills a pretty small niche. A much smaller niche than I thought before. If you are considering a NoSQL implementation, you should probably satisfy several of the following conditions:

  • willing to develop in house expertise in NoSQL storage, monitoring, backups, analysis, tuning
  • large dataset
  • a lot of unstructured data
  • no schema design

Leaving aside the fact that “niche” can be a subjective term in this context (i.e. is 20% of the companies handling 80% of the data a niche?), these may be valid concerns raised by someone looking at NoSQL.

Every tech shop looking to adopt a new technology that helps them solve a real problem will have to develop some form of in house expertise. Indeed the availability of tools, support and external resources is important.

These days when the quantity of digital data is growing exponentially, trying to control the format of the data seems like an attempt to swim against the river. That’s not to say that structure doesn’t matter, but only that instead of spending time putting order into chaos, we would be better off solving our problems at hand.

Concluding, NoSQL is definitely not a silver bullet and is not here to replace any of the existing technologies and while concerns about new technologies have always existed, it is always useful to have the right tools around.

08:01 Shuck & Awe #6: Hunting for Perl (4174 Bytes) » The Pythian Blog
[yanick@enkidu shuck]$ perl -MFile::Find::Rule \
    -e'INIT{@ARGV=File::Find::Rule->file->name("*.news")->in("blogs")}'

Remember me mentioning David Wheeler’s CPAN-like project for PostgreSQL? Well, by now it has an official name — PGXN — and the ball has now been set into motion. This is going to be good.

bingos decided to take the Dist::Zilla leap this week. A few plugins have already been churned out as the result.

Danger Will Robinson! If you are using File::chdir, David Golden warns that Perl 5.13.1 broke it by fixing a tied variable-related bug. Things are expected to be back to normal with Perl 5.13.2.

In a glorious display of shininess, garu shares with us Tweetylicious, a microblogging Twitter-lookalike written using Mojolicious::Lite. He even organized the commits for the project in a tutorialish way, making it the perfect introduction to Mojolicious.

Daisuke Maki brings some mind-boggling news from Japan. A rake and ne’er-do-well seems to have applied for the copyright on the word ‘Perl’ in Japan… and the request has been accepted. Japanese mongers are not amused.

oylenshpeegul came up with the pattern for a knitted Perl’s camel dishcloth (which he wrote using POD, natch). Good timing too, considering that we are in the midst of World Wide Knit in Public Day.

Ashley Pond V delights us with a clever mix of Perl maze generator and CSS wizardry.

Dave Jacoby shares his new-found love for IO::Interactive.

fREW announces the arrival of DBIx::Class::DeploymentHandler, which is touted as being even more awesome than DBIx::Class::Schema::Versioned. It slices, it dices, it does upgrades and downgrades!

ap0calypse begs you to do your part to stop the all-to-general terrible abuse of print statements and use heredocs whenever justified.

Adam Kennedy reports that the top 100 CPAN modules (in term of dependencies) suddenly gained some weight. We are still unsure of the specific cause. Personally, I blame it on the codenaming of the Windows port of Perl after delicious flavors of icecream.

Inspired by perlbrew, that nectar from the gods, bingos came out with smokebrew, which follows the same general idea, but is geared toward the creation of Perl testing environments.

[yanick@enkidu shuck]$ perl -E'sleep 2 * 7 * 24 * 60 * 60 # see y'all in 2 weeks!'
07:43 Dist::Zilla autocomplete (2438 Bytes) » The Pythian Blog

Does anyone know of a Yak Shaving Anonymous association hackers addicted to shearing Tibetan bovines could join?

Anyway, here are two little things I hacked on top of Dist::Zilla that peeps might find useful.

The first, as hinted by the blog entry’s title, is a direct adaptation of Aristotle’s perldoc-complete for dzil.

$ dzil <tab>
build     install   new       plugins   rjbsver   smoke     xtest
clean     listdeps  nop       release   run       test

The second is actually the one that started that round of shaving for me. As there is about a gazillion Dist::Zilla plugins, I wanted to have a quick way to see all the plugins installed on a specific machine. Enter a new dzil sub-command: plugins.

$ dzil plugins
[ lotsa plugins ]
MatchManifest - Ensure that MANIFEST is correct
MetaConfig - summarize Dist::Zilla configuration into distmeta
MetaJSON - produce a META.json
* MetaNoIndex - Stop CPAN from indexing stuff
MetaProvides - Generating and Populating 'provides' in your META.yml
MetaProvides::Class - Scans Dist::Zilla's .pm files and tries to identify classes using Class::Discover.
MetaProvides::FromFile - In the event nothing else works, pull in hand-crafted metadata from a specified file.
MetaProvides::Package - Extract namespaces/version from traditional packages for provides
* MetaResources - provide arbitrary "resources" for distribution metadata
MetaTests - common extra tests for META.yml
MetaYAML - produce a META.yml
[ still lotsa plugins ]

The plugins marked with an asterix are used by the current dist.ini. Also, give a plugin name to the sub-command, and it’ll act as perldoc Dist::Zilla::Plugin::<Plugin Name>. The sweet thing is, autocomplete also works there:

$ dzil plugins Pod<Tab>
PodCoverageTests  PodSyntaxTests    PodVersion        PodWeaver

Both patches are available on my Github fork of Dist::Zilla. Enjoy!

01:59 Integrating Hive and HBase at Facebook (1890 Bytes) » myNoSQL
Integrating Hive and HBase at Facebook:

While definitely interesting, something doesn’t seem to add up:

It (nb HBase) sidesteps Hadoop’s append-only constraint by keeping recently updated data in memory and incrementally rewriting data to new files, splitting and merging intelligently based on data distribution changes. Since it is based on Hadoop, making HBase interoperate with Hive is straightforward, meaning HBase tables can be accessed as if they were native Hive tables. As a result, a single Hive query can now perform complex operations such as join, union, and aggregation across combinations of HBase and native Hive tables. Likewise, Hive’s INSERT statement can be used to move data between HBase and native Hive tables, or to reorganize data within HBase itself.

What I seem to not understand is:

So why HBase?

01:15 CouchDB Case Study: Web Based IRC (1696 Bytes) » myNoSQL
CouchDB Case Study: Web Based IRC:

Another CouchDB case study this time from Anologue:

Initial goal: enable any number of people to view a web page that would serve as a sort of chat-room. Generate a link, share it with whomever you’d like to participate in the dialogue, type your name and text to add to the conversation.

I’d speculate that CouchDB was used due to its possibly simplified architecture of the web app and its document-based data model. Definitely not based on some “fake” or just plain wrong reasons.

Adding it to the list of CouchDB case studies.

00:56 When should I use MongoDB? (2769 Bytes) » myNoSQL
When should I use MongoDB?:

Leaving aside a couple of small warnings, Brandon Keepers’ answer to the question when should I use MongoDB is:

Always.

No, seriously!?

OK, I think MongoDB makes sense with most web applications. In the end, most apps are just doing glorified CRUD, and don’t need ACID or many of the features of a relational database. There are times when you definitely should not use MongoDB, like when you need transactions.

NO, NO, NO. If you do that, then you’re definitely doing it wrong! Again!

Take a step back and think about this hypothetical situation: you’ve been driving the same car for the last 30 years. But recently (you’ve come out of your cave and) you discover there are other new shiny cars around, some looking and feeling and offering quite a different feature set than your old beloved car, others just being an upgraded version of your old cosy car.

Question is: would you go to the first car dealer and just get one of these “upgraded” cars? Or would you take some time to look at all these shiny new toys and see how each of them would improve your driving experience and safeness?

Ignoring for a moment my interest in shiny new things, I think we should always keep eyes open, evaluate our options and try to reach conclusions based on what we need. Remaining always inside our comfort zone will not help us experience the new, the better, the different.

So, what is my answer to the question “when should I use MongoDB”: only if it makes sense. Or expanding on this a bit, I’d say you should take your time to analyze your options and think about what you need firstly. Think about your data and your app data access patterns. Then use MongoDB only if is the best tool for your current job.

2010-06-15 Tue

18:00 Presentation: OrientDB introduction (1548 Bytes) » myNoSQL

Not long away I’ve written about a couple of new projects in the NoSQL space among them listing OrientDB.

The slides below are offering a quick overview of OrientDB:

  • what is OrientDB
  • database structure
  • API
  • clustering and partitioning
17:59 What does Handler_read_rnd mean? (4600 Bytes) » MySQL Performance Blog

MySQL’s SHOW STATUS command has two counters that are often confusing and result in “what does that mean?” questions:

  1. Handler_read_rnd
  2. Handler_read_rnd_next

As I understand it, there is some historical context to the choice of names here, hearkening back to before I was involved with MySQL, way back when it was a wrapper around ISAM tables — or even earlier, I don’t know. (Unireg, anyone?) In any case, these two variables deserve a little explanation.

Both counters indicate the number of times the corresponding storage engine API function has been called. In olden times, the storage engine API was called the handler API, which is why the variables begin with Handler_.

Handler_read_rnd counts the number of times the handler::rnd_pos() method is called. This method fetches a row from a table based on a “fixed position,” i.e. a random-read. What this actually means varies between storage engines. For MyISAM, position really means a byte offset from the beginning of the file. For InnoDB, it means to read a row based on a primary key value.

Handler_read_rnd_next is incremented when handler::rnd_next() is called. This is basically a cursor operation: read the “next” row in the table. The operation advances the cursor position so the next time it’s called, you get the next row.

The naming convention probably doesn’t make as much sense today as it did historically. Monty told me that “rnd” meant “random,” but now I can’t find a reference to that conversation. In any case, one of the operations is basically a random read (if you think of tuples as having a natural order, which historically they did), and the other is just a read-next operation; there’s nothing random about it, really. I would welcome any corrections from those who know more about this than I do.

Why would either of these be called? Usually Handler_read_rnd is called when a sort operation gathers a list of tuples and their “position” values, sorts the tuples by some criterion, and then traverses the sorted list, using the position to fetch each one. This is quite likely to result in retrieving rows from random points in the table, although that might not actually result in random IO if the data is all in memory. Handler_read_rnd_next is usually called for each row of a full or partial table scan.

PS: I’m being a bit vague about the sort-and-fetch algorithm; there are really two algorithms, and they are smarter than I’m giving them credit for. But that’s a topic for another blog post.


Entry posted by Baron Schwartz | 4 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

13:20 APEX — Bulk Images Upload Using EPG » The Pythian Blog

2010-06-14 Mon