Tip: 看不到本站引用 Flickr 的图片? 下载 Firefox Access Flickr 插件 | AD: 订阅 DBA notes -- ![]()
2010-06-15 Tue
With the recent release 0.11.0 Riak switched the default backend storage from using embedded Innostore to Bitcask.
Andy Gross and johne had a very interesting conversation about the differences between Innostore and Bitcask Riak backend stores:
innostore currently creates a file per bucket/partition combo but all other backends use one file per partition unless you really want innostore, we recommend you use bitcask one other thing with buckets: buckets dont consume any resources as long as they use the bucket defaults - either the stock riak defaults or ones you set in your app.config buckets that change some of those defaults take up a small amount of space in the ring data structure that’s gossiped around
(Note: The review was done as part of our consulting practice, but is totally independent and fully reflects our opinion)
In my talk on MySQL Conference and Expo 2010 “An Overview of Flash Storage for Databases” I mentioned that most likely there are other players coming soon. I actually was not aware about any real names at that time, it was just a guess, as PCI-E market is really attractive so FusionIO can’t stay alone for long time. So I am not surprised to see new card provided by Virident and I was lucky enough to test a pre-production sample Virident tachIOn 400GB SLC card.
I think it will be fair to say that Virident targets where right now FusionIO has a monopoly, and it will finally bring some competition to the market, which I believe is good for the end users. I am looking forward to price competition ( not having real numbers I can guess that vendors still put high margin in the price) as well as high performance in general and stable performance under high load in particular, and also competition in capacity and data reliability areas.
Priceline for Virident tachIOn cards already shows the price competition: oriented price for tachIOn 400GB is 13,600$ (that is 34$/GB) , and entry-base card is 200GB with price 6,800$ (there also is 300GB card in product line). Price for FusionIO 160GB SLC ( from dell.com, price on 14-Jun-2010 ) is 6,308.99$ ( that is 39.5$/GB)
Couple words about product, I know that Virident engineering team was concentrating on getting stable write performance in long running
write activities and in cases when space utilization is close to 100%. As you may know (check my presentation) SSD design requires background
“garbage collector” activity, which requires space to operate and Virident card already has enough space reservation to get stable write performance even when the disk is almost full.
As for reliability, I think, the design of the card is quite neat. The card by itself contains bunch of replaceable flash modules, and each individual module can be changed in case of failure. Also internally modules are joined in RAID (it is fully transparent for end user).
All this guarantees good level of confidence in data reliability: if a single module fails, the internal RAID will allow to continue operations, and after the replacement of module – it will be rebuilt. It still leaves the controller on card as single point of failure, but in this case all flash modules can be safely relocated to the new card with working controller. (Note: It was not tested by Percona engineers, but taken from vendor’s specification)
As for power failures – flash modules also come with capacitors which guarantees data delivery to final media even if power is lost on the main host. (Note: It was not tested by Percona engineers, but taken from vendor’s specification)
Now to most interesting part – performance numbers. I took sysbench fileio benchmark with 16KB blocksize to see what maximal performance we can expect.
Server specification is:
- Supermicro X8DTH series motherboard
- 2 x Xeon E5520 (2.27GHz) processors w/HT enabled (16 cores)
- 64GB of ECC/Registered DDR3 DRAM
- Centos 5.3 2-6.18.164 Kernel
- Filesystem is XFS formatted with
mkfs.xfs -s size=4096option ( size=4096, sector size, is very important to have aligned IO requests) and mounted withnobarrieroption - Benchmark: sysbench fileio on 100GB file, 16KB blocksize
The raw results are available on Wiki
And the graphs for random read, writes and sequential writes:
I think very interesting to see distribution of 95% response time results ( 0 time is obviously the problem in sysbench, which has no enough time resolution for such very fast operations)
As you can see we can get about 400MB/sec random write bandwidth with 8-16 threads and
with 3.1ms (for 8 threads) and 3.8ms (16 threads) response time in 95% of cases.
As some issue here, I should mention, that despite the good response time results,
the maximal response time in some cases can jump to 300 ms per request, and I was told
it corresponds to garbage collector activity and will be fixed in the production release of driver.
I think it would be fair to get comparison with FusionIO card, especially for write pressure case
As you may know FusionIO recommends to have space reservation to get sustainable write performance
(Tuning Techniques for Writes).
I took FusionIO ioDrive 160GB SLC card, and tested fully formatted card (filesize 145GB), card formatted with 25% space reservation (file size 110GB), and Virident card 390GB filesize. It also allows us to see if Virident tachIOn card can sustain write in fully utilized card.
As disclaimer I want to mention that Virident tachIOn card was fine tuned by Virident engineers, while FusionIO card was tuned only by me and I may not have all knowledge needed for FusionIO tuning.
First graph is random reads, so see compare read performance
As you see in 1 and 4 threads FusionIO is better, while with more threads Virident card scales better
And now random writes:
You can see that FusionIO definitely needs space reservation to provide high write bandwidth, and it comes with
cost hit ( 25% space reservation -> 25% increase $/GB).
In conclusion I can highlight:
- I am impressed with architecture design with replaceable individual flash modules, I think it establishes new high-end standard for flash devices
- With single card you can get over 1GB/sec bandwidth in random reads (16-64 working threads), and it is the maximal results what I’ve seen so far ( again for single card)
- Random write bandwidth exceeds 400MB/sec (8-16 working threads)
- Random read/write mix results are also impressive, and it can be quite important in workloads like FlashCache, where card have both concurrent read and write pressure
- Quite stable sequential writes performance (important in question for log related activity in MySQL)
I am looking forward to present results in sysbench oltp, tpcc workload, and also in FlashCahce mode.
Entry posted by Vadim | No comment
2010-06-14 Mon
I spent some time last month getting up to speed on MySQL. One of the nice perks of working at Pythian is the ability to study during the workday. They could have easily said “You are an Oracle DBA, you don’t need to know MySQL. We have enough REAL MySQL experts”, but they didn’t, and I appreciate.
So how does an Oracle DBA goes about learning MySQL?
Obviously you start by reading the docs. Specifically, I looked for the MySQL equivalent of the famous Oracle “Concepts Guide”.
Unfortunately, it doesn’t exist. I couldn’t find any similar overview of the architecture and the ideas behind the database. The first chapter of “High Performance MySQL” had a high level architecture review, which was useful but being just one chapter in a book, it lacked many of the details I wanted to learn. Peter Zaitsev’s “InnoDB Architecture” presentation had the kind of information I needed – but covered just InnoDB.
Thats really too bad because I definitely feel the lack – which I can easily tell you what Oracle does when you connect to a database, run a select, an update, commit or rollback – I can’t say the same about MySQL. So far I managed without this knowledge, but I have a constant worry that this will come back and bite me later.
Lacking a concepts guide, I read the documentation I had access to: Sheeri has nice presentations available for Pythian employees (and probably customers too. I’m not sure if she ever released them to the whole world). The official documentation is not bad either – it covers syntax without obvious errors and serves as a decent “how do I do X?” guide.
But reading docs is only half the battle. The easier half too. So I installed MySQL 5.1 on my Ubuntu from ready packages. Then I installed MySQL 5.5 from the tarball – which was not nearly as much fun, but by the time this worked I know much more about where everything is located and the various ways one can mis-configure MySQL.
Once the installation was successfull, I played a bit with users, schemas and databases. MySQL is weird – Schemas are called databases, users have many-to-many relation with databases. If a user logs in from a differnet IP, it is almost like a different user. If you delete all the data files and restart MySQL – it will create new empty data files instead. You can easily start a new MySQL server on the same physical box by modifying one file and creating few directories.
MySQL docs make a very big deal about storage engines. There are only 2 things that are important to rememeber though: MyISAM is non-transactional and is used for mysql schema (the data dictionary), it doesn’t have foreign keys or row level locks. InnoDB is transactional, has row level locks and is used everywhere else.
There are a confusing bunch of tools for backing up MySQL. MySQLDump is the MySQL equivalent of Export. Except that it creates a file full of the SQL commands required to recreate the database. These files can grow huge very fast, but it is very easy to restore from them, restore any parts of the schema or even modifying the data or schema before restoring.
XTRABackup is a tool for consistent backups of InnoDB schema (remember that in MyISAM there are no transactions so consistent backups is rather meaningless). It is easy to use – one command to backup, two commands to restore. You can do PITR of sorts with it, and you can restore specific data files. It doesn’t try to manage the backup policies for you the way RMAN does – so cleaning old backups is your responsibility.
Replication is considered a basic skill, not an advanced skill like in the Oracle world. Indeed once you know how to restore from a backup, setting up replication is trivial. It took me about 2 hours to configure my first replication in MySQL. I think in Oracle Streams it took me few days, and that was on top of years of other Oracle experience.
Having access to experienced colleagues who are happy to spend time teaching a newbie is priceless. I already mentioned Sheeri’s docs. Chris Schneider volunteered around 2 hours of his time to introduce me to various important configuration parameters, innoDB secrets and replication tips and tricks. Raj Thukral helped me by providing step by step installation and replication guidance and helping debug my work. I’m so happy to work with such awesome folks.
To my shock and horror, at that point I felt like I was done. I learned almost everything important there was to know about MySQL. It took a month. As an Oracle DBA, after two years I still felt like a complete newbie, and even today there are many areas I wish I had better expertise. I’m sure it is partially because I don’t know how much I don’t know, but MySQL really is a rather simple DB – there is less to tweak, less to configure, fewer components, less tools to learn.
Jonathan Lewis once said that he was lucky to learn Oracle with version 6, because back then it was still relatively simple to learn, but the concepts didn’t change much since so what he learned back then is still relevant today. Maybe in 10 years I’ll be saying the same about MySQL.
- Juan Maiz: Connecting to MongoHq with Haskell ¶
-
Looks like today we’ll have some geeky stuff:
This connects, logs in, inserts and then retrieves a document in a collection.
- ksankar: A path through a NOSQLSummer Reading ¶
-
In case you have troubles picking up the first NOSQL summer paper, you mind find this classification useful. As a side note, Bucharest was the first city organizing the event and we had quite some fun discussing freely about the CAP Theorem.
- vonconrad: Which of CouchDB or MongoDB suits my needs? ¶
-
Anyone willing to help him out? The presented scenario is quite interesting and definitely a good fit for document databases.
- MongoDB driver for Delphi: pebongo ¶
-
I don’t think I know anyone doing Delphi programming, but why not posting about it since I’ve already covered other “geeky” stuff like Smalltalk and CouchDB or Using Google V8 with MongoDB
- Terrastore Clojure client API: terrastore-cloj ¶
-
A Clojure client for Terrastore document store coming in two flavors: chainable and bookmarkable.
Salvatore Sanfilippo:
The default and usually the preferred way for a client to chat with a Redis server is using the TCP protocol described in the Protocol Specification. In some environments the trade off of switching to a less reliable and not feature complete protocol running over UDP in order to improve latency is a good idea, so starting from Redis 2.2 there is support for a binary UDP protocol.
Based on a recent ☞ tweet it looks like the new Redis UDP protocol is already delivering some good results:
There are advantages (200k req/sec) but not so big as reported
Update: there are more comments about the performance of the Redis UDP protocol:
weird UDP results… in my over-ethernet tests, sometimes much faster, sometimes slower. I need to understand better what’s going on…
[…]
From the data I’ve currently, UDP is a big win when there are many clients interested in the same redis instance. For instance in clusters. So you got 100 redis servers and 500 FEPs. For this tasts UDP is the way to go. And indeed, Facebook is using memcached like that
2010-06-13 Sun
据接近 Google 和 Facebook 的一位朋友说,这两家全球排名数一数二的网站,近来也颇为内部网络拥塞问题挠头。服务器性能越来越强,数据吞吐量越来越大,加之内部应用组件/服务间的数据交换日趋错综复杂,在数据处理单元尚未达到峰值之前,数据传输单元却往往率先达到瓶颈,从而触发雪崩。
但从今天的状况来看,大鲸鱼依然不时浮现,可见这个问题尚未得到根本解决。如何驯服难以预估的流量怪兽,还得想一些办法。比如,当请求-反馈路径上的任何一个环节容量占用超过 90% 时,系统应自动进入戒备状态,按预定义的优先级列表依次推迟或关闭非关键的后台任务、内部应用,屏蔽爬虫来访,暂停访问量大的第三方应用,以腾出容量应付真正的客户访问需求。待利用率回落至 70% 以下,才解除警戒,陆续自动恢复上述被推迟、关闭、屏蔽、暂停的访问。
大多数网站对于自己的访问量构成、利用率状况、请求-反馈路径都没有作认真统计分析整理,自然也不会有关停优先级列表这种东西出来,更遑论自动调整。什么都重要,就是什么都不重要,什么都要保,于是什么都保不住。突发事件一来,大家一起死。
相信 Twitter 团队的聪明人会找到解决问题的有效方法,毕竟吃互联网这碗饭,可用性就是下锅的米。
建立数据库性能模型,这是我最近一直在思考的一个问题。这个命题还是非常有意义的,因为我们在很多情况下都需要对数据库做性能评估,容量规划和风险预测。很多DBA的优化经验都局限在一个很小的数据库技术领域内,而对整个系统的性能容量并不十分了解。我希望能够给大家一些简单的模型和经验数据,帮助大家对系统的整体性能有一个更深层次的了解。
这篇PPT可能还达不到模型的理论高度,甚至很多数据还不是十分准确,只是我个人思考的一个结果,希望能抛砖引玉,大家一起思考和进步。
2010-06-12 Sat
您的支付宝账户还在裸奔吗?或者您还在为安装数字证书而烦恼吗?
6月2日数字证书全面升级:
1. 无需备份、无需导入,只要一部手机就能轻松搞定安装(接收短信校验码)
2. 即时安装即时保护账户,免去了电脑重装后证书不可用、非本机操作时无法进行账户操作等烦恼
3. 新增email+安全保护问题的安装方式,多一种选择多一重账户保护方式
安装步骤
l 体验中若有碰到问题,觉得爽和不爽的地方都统统告诉我们吧(问题收集邮箱:aq_szzs@alipay.com )
No related posts.
AnySQL.net
Oracle & Starcraft
Give you some color to see see!
Oracle Scratchpad
Oracle Life
Chanel [K]
Oracle Security Blog
MySQL Performance Blog
The Tom Kyte Blog
Delicious/Fenng/oracle
O'Reilly Databases
Red Hat Magazine
车东[Blog^2]
blue_prince
玉面飞龙的BLOG
木匠 Creative and Flexible
生活帮-LifeBang
Hey!! Sky!
dba on unix
Brotherxiao's Home
jametong's shared items in Google Reader
DBA Tools
Inside the Oracle Optimizer - Removing the black magic
DBA@Taobao
存储部落
OracleBlog.cn
知道分子
支付宝官方 Blog - 支付志
木匠的天空 Oracle Architect and Developer
Hello DBA
OS与Oracle
Cary Millsap
Guy Harrison's main page
eagle's home
dbthink
DBA Notes
OracleDBA Blog---三少个人涂鸦地!
The Pythian Blog
myNoSQL
OracleDBA Blog---三少个人涂鸦地!
DBA@SKY-MOBI












