123
 123

Tip: 看不到本站引用 Flickr 的图片? 下载 Firefox Access Flickr 插件 | AD: 订阅 DBA notes --

2017-08-25 Fri

02:52 Supervised Machine Learning: A Conversational Guide for Executives and Practitioners (21725 Bytes) » Official Pythian Blog

This post gives a systematic overview of the vital points to consider when building supervised learning models. We address in Q&A style some of the key decisions/issues to go over when building a machine learning/ deep learning model. Whether you are an executive or a machine learning engineer, you can use this article to help start comfortable conversations with each other to facilitate stronger communication about machine learning.

The field of machine learning is rapidly evolving, and with so many materials out there it can sometimes feel overwhelming and confusing. At other times, a plethora of resources can result in what is known as “the resource curse,” where you have so much but are benefitting so little. In situations like this, there is a need for guidance, for someone to lead by putting the pieces together and interpreting the entire landscape so that it is evident how all the pieces connect.

This post provides a high-level rundown of some of the vital recipes to follow when building a machine learning model. Aspiring machine learning practitioners express some of their challenges in questions like “how do I begin building machine learning models?”, “what are the key points to consider when working with machine learning?”. This article addresses these issues and also helps non-specialists in leadership roles understand how machine learning engineers approach these questions.

Where can we find sample data sets

The first step to doing machine learning is to have data. As shown in a previous write-up on machine learning, there is no learning without data.

Thankfully, there are a ton of resources on the internet where we can easily access free and open datasets to use for machine learning practice. A few notable portals are:
UCI Machine Learning repository
KDNuggets
Kaggle

Choosing a programming language

The choice of a suitable programming language for machine learning development is a tricky one, and for the most part, depends on the practitioner’s level of comfort is using a particular type of language. However, certain languages have more advanced and mature learning packages and software libraries that are best suited to use as a preferred tool for machine learning / deep learning practice.

The top two preferred languages for deep learning/machine learning development from a practitioner’s perspective are:
– Python, &
– R

Following up from the previous point, a crucial decision to choosing a programming language is down to the maturity of its machine learning/deep learning software packages, and that is where Python and R stand heads and shoulders above other languages.

A good programming language is one in which a good number of reusable machine learning/deep learning functions exists. These methods have been written and debugged/tested by the best experts in the field, as well as a large supporting community of developers that contribute their time and expertise to maintain and improve the code. Having this function abstracts the practitioner from the internals of a particular algorithm to rather think and work at a higher level.

What do we do with Big Data?

The challenge working with Big Data is just its sheer size. Traditionally when working with moderately sized datasets like the ones found on the UCI machine learning repository, all the data is loaded into memory before being processing. Loading the entire data into memory is just impossible when working with Gigabyte or Tera/Petabyte sized datasets.

The available solutions to training Big Data are:
1) To build a Spark/ Hadoop cluster of machines
2) To buy a dedicated GPU hardware, or
3) To leverage cloud services like Google Cloud Platform (GCP), Amazon Web Services (AWS), and Microsoft Azure

Note: Future tutorials will make use of cloud resources for building production ready learning models. Please read the articles on Machine Learning on the Cloud and A Gentle Introduction to Google Cloud Platform for Machine Learning Practice to understand why leveraging the cloud is a preferred option for learning on big data.

Exploring the Dataset

Before passing data into any machine learning/ deep learning algorithm for processing, a critical step is to study the dataset. By this, we mean:
– to visualize the data (i.e., make exploratory plots),
– examine the features of the data sets and their probability distribution

Let’s look more closely into some of this exploratory actions, and why they are necessary.

Visualize the data

We can perform two broad types of visualization on the dataset, and they are single and multi-feature visualization.

Single feature visualization provides visual insight into a particular feature or attribute in the dataset. Examples of single feature visualization are histograms and boxplots. Multi-feature visualization as the name implies can be used to inspect two or more features in the data set visually. Examples are scatter-plots and pairwise boxplots.

Visualizations are useful for spotting anomalies in the dataset that can otherwise affect the performance of the model. Some of the anomalies to watch out for are outliers and skewed data.


Visualizations are useful for spotting anomalies in the dataset
Click To Tweet


Examine the features of the data set

Machine learning models can only be built using numeric features of the dataset. In other words, this implies that all features of the dataset must be numeric else we cannot use them for building models.

By examining the features, we can ascertain which features are numeric and which ones are not. We can transform non-numeric (i.e., “categorical” or factors with levels) features into numeric variables using a technique called one-hot encoding. For other learning problems such as natural language problems which deal exclusively with textual data, there exist advanced techniques to transform text data into numeric representations (but that is beyond the scope of this post).

Also, there may be a need to make sure the features of the dataset are on a similar scale before training the model. Why is this so? Because if the feature values are on vastly different scales, it leads to a poorly performing model. An easy way to get features into the same range is to use a technique called mean normalization which subtracts the average from each record and divides by either the range or standard deviation of the feature set.

Another reason for exploring the dataset is to apply some data transformations on the features or variables. For example, if a particular feature is skewed, we can use a log-transformation on that feature to normalize the variables (i.e. to make them have a bell-curve or Gaussian curve for the specialists reading this), or we can apply any other relevant power transform like the Box-Cox transformation. From empirical observation, it helps in overall prediction accuracy to have normalized features in the dataset.

Dealing with outliers

In the case of outliers it is useful to investigate further to understand the reason for that outlier, sometimes it might just be due to human error during data entry or an instrument malfunction or a measurement error. In practice, it is useful to detect and remove outliers as they can significantly hamper the model performance.

Dealing with missing values

Some machine learning/deep learning algorithms may not produce expected results or cause errors, due to the presence of missing values in the data. Developing a strategy for dealing with missing values is critical before moving on to train the model.

There are various strategies for dealing with missing values. A couple of them include:
– replacing missing values with the mean of the feature they belong to (also called imputation) or
– removing all missing values if it still leaves a sizeable data set.

Split the dataset into training, validation & test sets

The goal of machine learning is to be able to predict or classify outcomes on unseen examples correctly. We can misjudge the performance of our learning models if we evaluate the model performance with the same samples used to train the model. To learn more about this, please view this post on Understanding Machine Learning: An Executive Overview.

To properly evaluate the performance of a learning algorithm, we need to set aside some data for testing purposes. This hold-out data is called a test set.

Another situation arises when we have trained the model on a dataset, and we need to improve the performance of the model by tuning some of the learning algorithm’s hyperparameters. Which dataset will we use to train the adjusted model?

We cannot use the test set for model tuning because if we do that, the model learns the test data set consequently rendering it unusable for model evaluation. Thus, we divide the data set into three compartments namely:
– the training set (to train the model),
– the validation set (for model tuning), and
– the test set (for model evaluation)

What are predictor/estimator and response variables?

In a dataset, estimator or predictor variables are the variables used to train the model, so it makes correct prediction/classification on the outcome variable. Estimators or predictors are also known as input variables while the outcome variable is known as the target variable. We can have two or more input variables but typically only one target variable in a learning problem.

We must decide from the learning problem which variables are inputs, and which one is the targets. The variable type of the outcome variable is important in choosing candidate machine learning/ deep learning algorithms as well as determining the evaluation metric. If the outcome is categorical, then the learning problem is a classification problem, whereas if the outcome variable is real-valued, then the learning problem is a regression task.

Choosing candidate algorithms

They are a ton of machine learning/deep learning algorithms and methods out there. Hopefully, the points discussed below can guide the practitioner in choosing which algorithm is most suitable for the learning problem at hand.

Just before we delve into explaining how to select candidate algorithms, please note that as a practitioner, it is not altogether necessary to be acutely aware of the salient mathematical details that govern a particular learning algorithm. However, one should have the following questions in mind when thinking about a particular algorithm.
1) How does this algorithm work (at least from a high-level perspective),
2) In what cases does this algorithm perform best,
3) What are the limitations of the algorithm, and
4) How do we interpret the results of the learning algorithm

The following broad guidelines can help in selecting the right type of algorithm that performs well on the dataset.
1) The underlying structure of the data.
2) The target variable type (i.e., is the target variable categorical or real-valued)

Let’s look briefly at each listed guideline.

1) The underlying structure of the data

Two broad classes of supervised machine learning/ deep learning algorithms exist, one is linear/parametric, while the other is nonlinear or nonparametric. Parametric methods assume that the relationship between the variables of the data is linear whereas nonlinear methods do not make such an assumption.

If the relationship between the variables in the dataset is linear, then we are better off using a linear machine learning algorithm. A simple technique to quickly check if linearity exists is to visualize the relationships in the dataset by plotting a pair-wise scatter diagram.

Surprisingly, sometimes linear methods tend to do well even when the relationship between variables in the data set is nonlinear. This phenomenon is particularly the case when we have lots of data. A learning principle known as Occam’s razor explains this phenomenon. Without loss of generality, Occam’s razor states that sometimes a simpler model is preferred when having to choose among competing models. We discuss more this occurrence in the section on bias vs. variance tradeoff.

In summary as a rule of thumb when selecting candidate algorithms, try to select at least one linear method (e.g., linear regression) just to gauge its performance. It can also form a benchmark by which we compare with other more sophisticated nonlinear candidate algorithms.

2) The outcome variable type (categorical or real-valued)

Some machine learning algorithms are better suited for predicting real-valued outputs, e.g., linear regression, while others are better suited for classifying outcomes of the response variable, e.g., logistic regression. So the learning task is a crucial decider in narrowing the classes of algorithms to consider when selecting from a pool of learning methods to use for training the model.

Bias vs. Variance trade-off

The concept of bias vs. variance is very central to machine learning and is key to understanding how the model is performing, as well as in deciding how best to improve the model.

Succinctly defined, bias is when the model oversimplifies the learning problem or when the model fails to accurately capture the complex relationships inherent in the dataset, thereby making the model unable to generalize to new examples. Variance, on the other hand, is when the model learns too closely the intricate patterns of the dataset and in the process learns the irreducible noise inherent in every data set. When the learning algorithm entirely learns the patterns, including the noise of the training data, it fails to generalize when exposed to unseen data.

Hence, we observe that constant tension between bias and variance, and often it is down to the skill of the model builder to ascertain the middle ground. This balance is unique to every dataset. However, some practical rules of thumb can prove useful to the machine learning practitioner.

Common empirical causes of bias

– using an overly simplistic algorithm for a nonlinear data set
– not enough data samples
– assigning large values to the regularization parameter (more of this in a future post)
– using too few features that don’t capture the learning problem adequately

Common empirical causes of variance

– using too many (high-dimensional) features
– tuning the hyperparameters of a nonlinear algorithm that it becomes too flexible and learns the noise in the dataset
– not using regularization, or using a minimal regularization value.

How do we recognize the presence of bias or variance in the results

Bias is evident when the model performs poorly on the trained data. Of course, it also performs poorly (or even worse) on test data.

Variance is detected when the trained model learns the training data very well but performs poorly on unseen (test) data.

How do we know that the model is performing well? Choosing an evaluation metric

Evaluation metrics give us a way to know how well our models are doing. We apply evaluation on the training data to get the training set accuracy, while we evaluate using test data to know the accuracy on unseen examples. Evaluation of test data helps us to know if our models are high performing, or if it exhibits bias or variance.

The learning problem type determines the type of evaluation metric to use. Typically, for real-valued prediction problems, we use the mean-squared error (MSE) for evaluation. Whereas, for “categorical” classification problems, it is best to plot a confusion matrix to get a clear picture of how many samples are correctly classified or misclassified. From the confusion matrix, one can know other useful metrics for evaluating classification problems such as accuracy, precision, and recall.

Improving the model

To improve the performance of the model, we can consider a variety of techniques using the validation dataset. A couple of options to consider are:
1) Systematic feature engineering
2) Using ensemble learning methods
3) Hyperparameter tuning of the algorithm

1) Feature Engineering

The practitioner usually spends a significant portion of time on feature engineering. Feature engineering is the practice of systematically researching on each feature in the dataset, and examining its relevance to the outcome variable.

Through feature engineering, we can cleverly introduce new features by combining one or more existing features based on insights gained, and this can have a massive impact on the prediction accuracy. Often time, the difference between a good model and a competition winning model is the quality of feature engineering performed.


Feature engineering is often the difference between a good model and a competition winning model
Click To Tweet


2) Ensemble methods

Ensemble methods generally speaking combines the output of various models to produce a stronger and better performing model. Two major classes of ensemble learning algorithms are:
– Boosting, and
– Bagging

Empirically speaking, ensemble methods such as Näive Bayes and Random forests are known to do very well in various machine learning problems and are the algorithms of choice for many competitors in machine learning competitions.

It is wise to try ensemble methods on the learning problem to see how well it performs in comparison with other candidate algorithms.

3) Hyper-parameter tuning

When applying a particular learning algorithm, we can adjust certain parameters of the algorithm. These parameters are called hyper-parameters principally because they are properties of the learning algorithm and not the model itself.

It is good practice to setup a grid search and adjust the hyperparameters to determine the configuration that results in an improved prediction/ classification accuracy.

Okay! That is all for now, folks!
Please be sure to leave your comments/ contribution below.

Remember, this post serves as a prelude to practical machine learning/ deep learning tutorials that build on the issues discussed here.

Thanks for reading and thanks to Paul Spiegelhalter for the technical review of the content.

Note: This was originally published in ekababisong.org

01:23 Log Buffer #519: A Carnival of the Vanities for DBAs (2771 Bytes) » Official Pythian Blog

This Log Buffer Edition covers Oracle, SQL Server and MySQL.

Oracle:

Fix index corruption found using analyze table validate structure

Customizing a CoreOS image/iso: Ignition

12.2 New Feature: the FLEX ASM disk group part 4

An Important Change Is Coming for Application Express 5.2 (if you use the Mobile User Interface)

Creating an Oracle Database Docker image

SQL Server:

Installing SQL Server Agent on SUSE Linux

Load Extended Events via Powershell

The SQL Server Features that Time Forgot

Using the Copy Wizard for the Azure Data Factory

Azure Resource Manager (ARM) Templates

MySQL:

Migrating Data from an Encrypted Amazon MySQL RDS Instance to an Encrypted Amazon Aurora Instance

Automatically Dropping Old Partitions in MySQL and MariaDB

Connecting PHP to MySQL on Bluemix

No more mysql.proc in MySQL 8.0

How ProxySQL deal with schema (and schemaname)

2017-08-24 Thu

22:55 Join Top Data Experts at Pythian’s Love Your Data Conference in London (1871 Bytes) » Official Pythian Blog

According to Gartner’s CIO 2017 survey, BI and Analytics continue to top IT investments.

Are you an IT or BI business leader looking for ways to drive business outcomes with data and analytics?  If so, join Pythian on September 13, 2017, at the Love Your Data Conference on-site at Google’s London office located at 123 Buckingham Palace Road.

During this one-day conference, you will find out how to create a data-driven organisation, hear from industry leading data experts and more.

The presentations will address the challenges and benefits associated with data and provide insights into powerful ways that businesses today can use their data for greater success.

You will gain valuable knowledge and hear about use cases from leading data experts at organisations including Google, Telefonica and the Direct Marketing Association who will provide practical information on the following:

 

  • How to transform your organisation using data and self-service analytics
  • Insights on ways to align IT with the business by giving all/relevant users access to data
  • Adding data intelligence and automation to business decisions
  • Get a 360-degree view of customers and promote innovation

 

Get out of the office for a day and be inspired about where data and analytics can take your business.

REGISTER TODAY

Space is limited; please register early to guarantee your seat.

22:19 Pythian Will Be at Percona Live Europe 2017! Here’s What You Need to Know… (2474 Bytes) » Official Pythian Blog

There are so many things you can do in beautiful Dublin, Ireland. Tour the historic St. Patrick’s Cathedral, kiss the Blarney Stone at Blarney Castle to summon up a bit of good luck…

But if you LOVE open source as much as we do, register to attend the premier open source database conference, Percona Live Europe 2017 , which will be taking place at the Radisson Blu Royal Hotel from September 25-27.

This tech focused conference will consist of tutorials, informative sessions and a fantastic opportunity to network amongst tech aficionados and top industry leaders. It also promises to provide an “emphasis on the core topics of MySQL, MongoDB, PostgreSQL and other open source databases”.

What has us really excited (like everything else wasn’t enough)? Providing you with the scoop on sessions set to be presented by a few members of our elite crew:

Monday, September 25, 2017:

Ronald Bradford (with co-presenter Colin Charles from Percona): Implementing MySQL Security Features | 1:30PM – 4:30PM

Tuesday, September 26, 2017:
Wednesday, September 27, 2017:

We know.  We’re pretty stoked too! Advanced registration tickets are available until September 5 so click here to get yours now.  See you there!

2017-08-23 Wed

01:24 How to Analyze SQL Server Stack Dumps (18386 Bytes) » Official Pythian Blog

As a SQL Server production DBA, at some point in time you will have to deal with a stack dump. A stack dump is a file that is written to disk with an “.mdmp” extension when SQL Server encounters a running condition that it does not have a built-in error to handle. It contains the call stacks and memory information of the threads running when this situation happened and it’s done so that it can be used for troubleshooting after the fact.

The reason for this type of error condition to happen varies quite a bit, and it can range from CPU scheduling problems, issues with RAM chips, very heavy IO latency, linked server drivers, etc. In the worst case, the stack dumps can cause serious system instability and can trigger cluster failovers or SQL Server service restarts. In this blog post, we’re going to focus on analyzing the somewhat common condition of the Non-Yielding scheduler.

Because Microsoft owns the SQL Server source code, they are best positioned to be able to troubleshoot these unhandled conditions. However, this doesn’t mean that we can’t do some analysis ourselves and in many cases get a solid idea of what is triggering the issue. As such, knowing how to do this analysis can be critical to getting quick workarounds in place while you are working with Microsoft support to find a root cause and permanent fix.

Now, let’s move to the actual practical piece and get started!

  • The first thing you’re going to want to do is to have a stack dump to analyze. If you have already experienced this on your system, then you can go find the .mdmp files in the Log directory where the SQL ERRORLOG is found.
  • Head over to this link and download and install the debugging tools for Windows. The one we are interested in is called WinDbg. There are several methods to get the tools on your system, I personally downloaded the Windows SDK ISO and then during installation only selected “Debugging tools for Windows”.
  • Open WinDbg (in my case it was located in C:\Program Files\Windows Kits\10\Debuggers\x64), go to File – Open Crash Dump.
  • The tool has now recognized that this file is a stack dump and there is information that could be valuable.

    We can verify that the dump was generated by a ‘Non-yielding Scheduler’ condition and WinDbg has found “an exception of interest stored in it”.

  • At this point, we can’t see much detail because we don’t have the public symbols for SQL Server on the analysis machine. Public symbols are files provided by the software developer (in this case Microsoft) that give high-level information on the executable to be able to do this type of debugging. For example, in the case of SQL Server, we want to be able to see the method names inside the C++ code. There are also private symbols which provide even more information like local variable names or source line info, but these are only accessible to the developer of the application, in this case Microsoft support.
  • To get the symbols we will download off of Microsoft’s server, go to the command line on WinDbg and type the following:

    .sympath srv*c:\debugsymbols*http://msdl.microsoft.com/download/symbols;

    This command will set the Symbols path for the tool on c:\debugsymbols and will point it to download them from Microsoft’s public symbol URL.

    To load the symbols now type:

    .reload /f

    The tool will start downloading the files and might throw some errors if it can’t find symbols for some of the DLLs. That’s OK as long as it was able to find the symbols for SQL Server.

    At this point, it’s a good idea to save this configuration in a workspace so that we don’t have to enter it every time. To do this, just go to File – Save Workspace and give it a name. Next time, you can open WinDbg, open the Workspace and then the crash dump and you won’t need to reload the symbols.

    Back to the command screen, we can verify that the symbols for SQL Server are available by running:

    lmvm sqlservr


    We can see in the output that indeed there is a match found in our symbols for the module mentioned in the dump, in this case, the main SQL Server executable sqlservr.exe.

    It also provides the exact version information for SQL Server. For example, the above screenshot identifies version 11.0.6594 which is SQL Server 2012 Service Pack 3, cumulative update #4. Of course, you can always get the information from the server itself, but if all you have is the file available, then this is an alternative
    way to get it. Knowing the exact version information is important since the dump could have been generated by a bug that is specific to a version.

  • OK, so now we’re all setup but where do we start to analyze the dump content? Keep in mind that the dump has information from all the threads active when the dump happened. However, we’re only interested in the thread that actually caused the issue. The easiest way to get started is to let WinDbg analyze the dump, see if it finds an exception and take you to that context.

    To do this, type the following command:

    !analyze –v

    This command will display where the exception is found and the call stack with it.

    In this particular case I got:

    ntdll!NtWriteFile+0xa

    KERNELBASE!WriteFile+0xfe

    kernel32!WriteFileImplementation+0x36

    sqlmin!DiskWriteAsync+0xd9

    sqlmin!FCB::AsyncWrite+0x19a

    sqlmin!RecoveryUnit::GatherWrite+0xc2

    sqlmin!BPool::LazyWriter+0x549

    sqlmin!lazywriter+0x204

    sqldk!SOS_Task::Param::Execute+0x21e

    sqldk!SOS_Scheduler::RunTask+0xa8

    sqldk!SOS_Scheduler::ProcessTasks+0x29a

    sqldk!SchedulerManager::WorkerEntryPoint+0x261

    sqldk!SystemThread::RunWorker+0x8f

    sqldk!SystemThreadDispatcher::ProcessWorker+0x372

    sqldk!SchedulerManager::ThreadEntryPoint+0x236

    kernel32!BaseThreadInitThunk+0xd

    ntdll!RtlUserThreadStart+0x1d

    There is a caveat around Non-yielding scheduler dumps. Sometimes, the stack that you get through analyze might not be the one that was not yielding. The reason for this is because, by the time the dump is generated, the task might have actually yielded. To cover for this scenario, SQL Server copies the offending stack to a global address that we can reach on the dump. This is the command to move to that structure:

    .cxr sqlmin!g_copiedStackInfo+0X20

    That .cxr command is used to set the debugger context to a particular address which we know beforehand in this case. You might be thinking, where I came up with that special copiedStackInfo context. The truth is that this is just knowledge that is floating around the Internet, which is based on information published by Microsoft’s own support team. For example, you can see one blog post here.

    Once I’m in that context, I get the call stack with this command:

    Kc 100

    The K command displays the call stack of the current context, the letter c means we want the output to be ‘clean’ and only show module and function names. Finally, the 100 means we want to see up to 100 calls down the stack. For this particular dump both stacks were the same, however, in other scenarios it could have been different.

  • With the call stack available, let’s see how to read it. First, remember it’s a stack so the method called last is actually the one at the top. In order to follow the logical sequence of events then we need to read from the bottom-up.

    In this particular call stack, we can see the following sequence of events:

    • Thread got a task assigned
    • The task was to do a lazy writer memory sweep
    • Lazy writer uses a Gather Write method to group many pages into one write IO
    • Then initiates an asynchronous IO request
    • Then the thread moves into kernel mode to do the file write operation

    The scheduler is not yielding while doing a lazy writer sweep to clear dirty pages and writing them to file. We’re looking then at a disk subsystem that is either too slow or becomes overloaded when this happens. With this information, we can start looking at these different factors to try to alleviate the situation:

    • Analyze the buffer cache and see which database is consuming most memory
    • Look into processes changing a lot of data at once and break them up
    • Look into storage latency and see if we’re not getting what we should in terms of latency and throughput

Let’s look at another case, this time it was a ‘Latch timeout’ situation. Let’s see the steps:

  • I open the workspace I created in the previous example and then open the dump. Here’s the output:

    We can see at the top the dump type as ‘Latch timeout’ and again we can see there is ‘an exception of interest stored in it’.

  • We can verify we got the symbols again and here’s what we get:

    We can tell that based on version, this is SQL Server 2005 SP4.

  • We run !analyze –v and this is the call stack generated:

    kernel32!RaiseException

    sqlservr!CDmpDump::Dump

    sqlservr!CImageHelper::DoMiniDump

    sqlservr!stackTrace

    sqlservr!LatchBase::DumpOnTimeoutIfNeeded

    sqlservr!LatchBase::PrintWarning

    sqlservr!_chkstk

    sqlservr!LatchBase::AcquireInternal

    sqlservr!KeyRangeGenerator::GetNextRange

    sqlservr!IndexDataSetSession::GetNextRangeForChildScan

    sqlservr!IndexDataSetSession::SetupNextChildSubScan

    sqlservr!IndexDataSetSession::GetNextRowValuesInternal

    sqlservr!RowsetNewSS::FetchNextRow

    sqlservr!CQScanTableScanNew::GetRow

    sqlservr!CQScanNLJoinTrivialNew::GetRow

    sqlservr!CQScanNLJoinNew::GetRowHelper

    sqlservr!CQScanNLJoinNew::GetRowHelper

    sqlservr!CQScanSortNew::BuildSortTable

    sqlservr!CQScanSortNew::OpenHelper

    sqlservr!CQScanNLJoinNew::Open

    sqlservr!CQScanNew::OpenHelper

    sqlservr!CQScanXProducerNew::Open

    sqlservr!FnProducerOpen

    sqlservr!FnProducerThread

    sqlservr!SubprocEntrypoint

    sqlservr!SOS_Task::Param::Execute

    sqlservr!SOS_Scheduler::RunTask

    sqlservr!SOS_Scheduler::ProcessTasks

    sqlservr!SchedulerManager::WorkerEntryPoint

    sqlservr!SystemThread::RunWorker

    sqlservr!SystemThreadDispatcher::ProcessWorker

    sqlservr!SchedulerManager::ThreadEntryPoint

  • The call stack is giving this sequence of steps:

    – Thread got a task assigned

    – A parallel operation is requesting a table scan

    – The thread is requesting a range of keys to work on

    – A latch is requested to get the range and it times out

    Going back to the SQL Server Errorlog around the time of the dump, we have messages like this one:

    Timeout occurred while waiting for latch: class ‘ACCESS_METHODS_SCAN_RANGE_GENERATOR’, id 00000002042630D0, type 4, Task 0x00000000055F8B08 : 7, waittime 300, flags 0x1a, owning task 0x00000000049B7C18. Continuing to wait.

    We know that the ACCESS_METHODS_SCAN_RANGE_GENERATOR is a latch type used to distribute work during parallel scans, so both the log and the dump are confirming the issue.

  • An alternative way to navigate the dump is to search the dump for a particular string. For this we can use the X command:

    X *!*DoMiniDump*

    I’m using the * character as a wildcard to tell WinDbg to go look inside any module (*!) and find a function that contains this string (*DoMiniDump*). Sure enough, I get this result:

    00000000`01f6c0d0 sqlservr!CImageHelper::DoMiniDump

    So now we know the exact function name we want, we can search for it with this command:

    !findstack sqlservr!CImageHelper::DoMiniDump

    And we get this result:

    Thread 054, 1 frame(s) match

    * 15 000000000fc5aba0 0000000001f6f797 sqlservr!CImageHelper::DoMiniDump+0x413

    From that result, we know that the thread that produced the dump was #54. So we move to that thread with this command:

    ~54s

    The ~ symbol is a command to do a thread related action, the 54 is the thread number we want and the letter s is telling the debugger to make this the current thread.

    Finally, same as the previous example, we can run kc 100 and we would get the call stack that I mentioned on step #3 above.

  • Given what we have found, this case definitely looks like a parallelism related bug but at least we know it is triggered by a parallel scan operation. While we get a fix from support, we could apply workarounds to avoid the parallel scan through indexing, query tuning or running the query with no parallelism for now.

So we’ve seen an example for ‘Non-yielding Scheduler’ and ‘Latch timeout’, but you can apply a similar process to analyze other dump types like ‘Deadlock scheduler’, ‘Unhandled exception’ and ‘Access violation’. Again, the goal is to be able to figure out quick workarounds that help you honor the SLA of your database and its applications even while dealing with product errors. In parallel, work with support to find a true resolution or you patch the server up if it’s something that already has a fix.