Double Moving Average Minitab
A series of repeatable steps for carrying out a certain type of task with data. As with data structures, people studying computer science learn about different algorithms and their suitability for various tasks. Specific data structures often play a role in how certain algorithms get implemented. See also data structure An open-source JavaScript library maintained by Google and the AngularJS community that lets developers create what are known as Single web Page Applications. AngularJS is popular with data scientists as a way to show the results of their analysis. See also JavaScript. D3 Also, AI . The ability to have machines act with apparent intelligence, although varying definitions of intelligence lead to a range of meanings for the artificial variety. In AIs early days in the 1960s, researchers sought general principles of intelligence to implement, often using symbolic logic to automate reasoning. As the cost of computing resources dropped, the focus moved more toward statistical analysis of large amounts of data to drive decision making that gives the appearance of intelligence. See also machine learning. data mining Also, backprop . An algorithm for iteratively adjusting the weights used in a neural network system. Backpropagation is often used to implement gradient descent. See also neural network. gradient descent Also, Bayes Rule . An equation for calculating the probability that something is true if something potentially related to it is true. If P(A) means the probability that A is true and P(AB) means the probability that A is true if B is true, then Bayes Theorem tells us that P(AB) (P(BA)P(A)) P(B). This is useful for working with false positivesfor example, if x of people have a disease, the test for it is correct y of the time, and you test positive, Bayes Theorem helps calculate the odds that you actually have the disease. The theorem also makes it easier to update a probability based on new data, which makes it valuable in the many applications where data continues to accumulate. Named for eighteenth-century English statistician and Presbyterian minister Thomas Bayes. See also Bayesian network. prior distribution Also, Bayes net . Bayesian networks are graphs that compactly represent the relationship between random variables for a given problem. These graphs aid in performing reasoning or decision making in the face of uncertainty. Such reasoning relies heavily on Bayes rule. bourg These networks are usually represented as graphs in which the link between any two nodes is assigned a value representing the probabilistic relationship between those nodes. See also Bayes Theorem. Markov Chain In machine learning, bias is a learners tendency to consistently learn the same wrong thing. Variance is the tendency to learn random things irrespective of the real signal. Its easy to avoid overfitting (variance) by falling into the opposite error of underfitting (bias). Simultaneously avoiding both requires learning a perfect classifier, and short of knowing it in advance there is no single technique that will always do best (no free lunch). domingos See also variance. overfitting. classification As this has become a popular marketing buzz phrase, definitions have proliferated, but in general, it refers to the ability to work with collections of data that had been impractical before because of their volume, velocity, and variety (the three Vs). A key driver of this new ability has been easier distribution of storage and processing across networks of inexpensive commodity hardware using technology such as Hadoop instead of requiring larger, more powerful individual computers. The work done with these large amounts of data often draws on data science skills. A distribution of outcomes of independent events with two mutually exclusive possible outcomes, a fixed number of trials, and a constant probability of success. This is a discrete probability distribution, as opposed to continuousfor example, instead of graphing it with a line, you would use a histogram, because the potential outcomes are a discrete set of values. As the number of trials represented by a binomial distribution goes up, if the probability of success remains constant, the histogram bars will get thinner, and it will look more and more like a graph of normal distribution. See also probability distribution. discrete variable. histogram. normal distribution Chi (pronounced like pie but beginning with a k) is a Greek letter, and chi-square is a statistical method used to test whether the classification of data can be ascribed to chance or to some underlying law. websters The chi-square test is an analysis technique used to estimate whether two variables in a cross tabulation are correlated. shin A chi-square distribution varies from normal distribution based on the degrees of freedom used to calculate it. See also normal distribution and Wikipedia on the chi-squared test and on chi-squared distribution . The identification of which of two or more categories an item falls under a classic machine learning task. Deciding whether an email message is spam or not classifies it among two categories, and analysis of data about movies might lead to classification of them among several genres. See also supervised learning. clustering Any unsupervised algorithm for dividing up data instances into groupsnot a predetermined set of groups, which would make this classification, but groups identified by the execution of the algorithm because of similarities that it found among the instances. The center of each cluster is known by the excellent name centroid. See also classification. supervised learning. unsupervised learning. k-means clustering A number or algebraic symbol prefixed as a multiplier to a variable or unknown quantity (Ex. x in x(y z) . 6 in 6ab websters When graphing an equation such as y 3x 4 . the coefficient of x determines the lines slope. Discussions of statistics often mention specific coefficients for specific tasks such as the correlation coefficient, Cramers coefficient, and the Gini coefficient. See also correlation Also, natural language processing . NLP . A branch of computer science for parsing text of spoken languages (for example, English or Mandarin) to convert it to structured data that you can use to drive program logic. Early efforts focused on translating one language to another or accepting complete sentences as queries to databases modern efforts often analyze documents and other data (for example, tweets) to extract potentially valuable information. See also GATE. UIMA A range specified around an estimate to indicate margin of error, combined with a probability that a value will fall in that range. The field of statistics offers specific mathematical formulas to calculate confidence intervals. A variable whose value can be any of an infinite number of values, typically within a particular range. For example, if you can express age or size with a decimal number, then they are continuous variables. In a graph, the value of a continuous variable is usually expressed as a line plotted by a function. Compare discrete variable The degree of relative correspondence, as between two sets of data. websters If sales go up when the advertising budget goes up, they correlate. The correlation coefficient is a measure of how closely the two data sets correlate. A correlation coefficient of 1 is a perfect correlation. 9 is a strong correlation, and .2 is a weak correlation. This value can also be negative, as when the incidence of a disease goes down when vaccinations go up. A correlation coefficient of -1 is a perfect negative correlation. Always remember, though, that correlation does not imply causation. See also coefficient A measure of the relationship between two variables whose values are observed at the same time specifically, the average value of the two variables diminished by the product of their average values. websters Whereas variance measures how a single variable deviates from its mean, covariance measures how two variables vary in tandem from their means. grus See also variance. mean When using data with an algorithm, the name given to a set of techniques that divide up data into training sets and test sets. The training set is given to the algorithm, along with the correct answers. and becomes the set used to make predictions. The algorithm is then asked to make predictions for each item in the test set. The answers it gives are compared to the correct answers, and an overall score for how well the algorithm did is calculated. segaran See also machine learning Data-Driven Documents. A JavaScript library that eases the creation of interactive visualizations embedded in web pages. D3 is popular with data scientists as a way to present the results of their analysis. See also AngularJS. JavaScript A specialist in data wrangling. Data engineers are the ones that take the messy data. and build the infrastructure for real, tangible analysis. They run ETL software, marry data sets, enrich and clean all that data that companies have been storing for years. biewald See also data wrangling. (A Wikipedia search for data engineering redirects to information engineering, an older term that describes a more enterprise-oriented job with greater system architecture responsibility and less hands-on work with the data.) Generally, the use of computers to analyze large data sets to look for patterns that let people make business decisions. While this sounds like much of what data science is about, popular use of the term is much older, dating back at least to the 1990s. See also data science The ability to extract knowledge and insights from large and complex data sets. patil Data science work often requires knowledge of both statistics and software engineering. See also data engineer. machine learning A particular arrangement of units of data such as an array or a tree. People studying computer science learn about different data structures and their suitability for various tasks. See also algorithm Also, data munging . The conversion of data, often through the use of scripting languages, to make it easier to work with. If you have 900,000 birthYear values of the format yyyy-mm-dd and 100,000 of the format mmddyyyy and you write a Perl script to convert the latter to look like the former so that you can use them all together, youre doing data wrangling. Discussions of data science often bemoan the high percentage of time that practitioners must spend doing data wrangling the discussions then recommend the hiring of data engineers to address this. See also Perl. Python. shell. data engineer A decision tree uses a tree structure to represent a number of possible decision paths and an outcome for each path. If you have ever played the game Twenty Questions, then it turns out you are familiar with decision trees. grus See also random forest Typically, a multi-level algorithm that gradually identifies things at higher levels of abstraction. For example, the first level may identify certain lines, then the next level identifies combinations of lines as shapes, and then the next level identifies combinations of shapes as specific objects. As you might guess from this example, deep learning is popular for image classification. See also neural network The value of a dependent value depends on the value of the independent variable. If youre measuring the effect of different sizes of an advertising budget on total sales, then the advertising budget figure is the independent variable and total sales is the dependent variable. Also, dimensionality reduction . We can use a technique called principal component analysis to extract one or more dimensions that capture as much of the variation in the data as possible. Dimensionality reduction is mostly useful when your data set has a large number of dimensions and you want to find a small subset that captures most of the variation. grus Linear algebra can be involved broadly speaking, linear algebra is about translating something residing in an m - dimensional space into a corresponding shape in an n - dimensional space. shin See also linear algebra A variable whose potential values must be one of a specific number of values. If someone rates a movie with between one and five stars, with no partial stars allowed, the rating is a discrete variable. In a graph, the distribution of values for a discrete variable is usually expressed as a histogram. See also continuous variable. histogram The use of mathematical and statistical methods in the field of economics to verify and develop economic theories websters The machine learning expression for a piece of measurable information about something. If you store the age, annual income, and weight of a set of people, youre storing three features about them. In other areas of the IT world, people may use the the terms property, attribute, or field instead of feature. See also feature engineering To obtain a good model, however, often requires more effort and iteration and a process called feature engineering. Features are the models inputs. They can involve basic raw data that you have collected, such as order amount, simple derived variables, such as Is order date on a weekend YesNo, as well as more complex abstract features, such as the similarity score between two movies. Thinking up features is as much an art as a science and can rely on domain knowledge. anderson See also feature General Architecture for Text Engineering, an open source, Java-based framework for natural language processing tasks. The framework lets you pipeline other tools designed to be plugged into it. The project is based at the UKs University of Sheffield. See also computational linguistics. UIMA Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. wikipediagb An optimization algorithm for finding the input to a function that produces the largest (or smallest) possible value. one approach to maximizing a function is to pick a random starting point, compute the gradient, take a small step in the direction of the gradient (i. e. the direction that causes the function to increase the most), and repeat with the new starting point. Similarly, you can try to minimize a function by taking small steps in the opposite direction. grus See also backpropagation A scripting language (no relation to Java) originally designed in the mid-1990s for embedding logic in web pages, but which later evolved into a more general-purpose development language. JavaScript continues to be very popular for embedding logic in web pages, with many libraries available to enhance the operation and visual presentation of these pages. See also AngularJS. D3 A data mining algorithm to cluster, classify, or group your N objects based on their attributes or features into K number of groups (so-called clusters). parsian See also clustering Also, kNN . A machine learning algorithm that classifies things based on their similarity to nearby neighbors. You tune the algorithms execution by picking how many neighbors to examine ( k ) as well as some notion of distance to indicate how near the neighbors are. For example, in a social network, a friend of your friend could be considered twice the distance away from you as your friend. Similarity would be comparison of feature values in the neighbors being compared. See also classification. feature In statistics, latent variables (from Latin: present participle of lateo (lie hidden),as opposed to observable variables), are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models. wikipedialv Lift compares the frequency of an observed pattern with how often youd expect to see that pattern just by chance. If the lift is near 1, then theres a good chance that the pattern you observed is occurring just by chance. The larger the lift, the more likely that the pattern is real. zumel A branch of mathematics dealing with vector spaces and operations on them such as addition and multiplication. Linear algebra is designed to represent systems of linear equations. Linear equations are designed to represent linear relationships, where one entity is written to be a sum of multiples of other entities. In the shorthand of linear algebra, a linear relationship is represented as a linear operatora matrix. zheng See also vector. vector space. matrix. coefficient A technique to look for a linear relationship (that is, one where the relationship between two varying amounts, such as price and sales, can be expressed with an equation that you can represent as a straight line on a graph) by starting with a set of data points that dont necessarily line up nicely. This is done by computing the least squares line: the one that has, on an x-y graph, the smallest possible sum of squared distances to the actual data point y values. Statistical software packages offer automated ways to calculate this. See also regression. logistic regression If y 10 x . then log(y) x . Working with the log of one or more of a models variables, instead of their original values, can make it easier to model relationships with linear functions instead of non-linear ones. Linear functions are typically easier to use in data analysis. (The log(y) x example shown is for log base 10. Natural logarithms, or log base e where e is a specific irrational number a little over 2.7are a bit more complicated but also very useful for related tasks.) See also dependent variable. linear regression A model similar to linear regression but where the potential results are a specific set of categories instead of being continuous. See continuous variable. regression. linear regression The use of data-driven algorithms that perform better as they have more data to work with, learning (that is, refining their models) from this additional data. This often involves cross-validation with training and test data sets. The fundamental goal of machine learning is to generalize beyond the examples in the training set. domingos Studying the practical application of machine learning usually means researching which machine learning algorithms are best for which situations. See also algorithm. cross-validation. artificial intelligence An algorithm for working with a series of events (for example, a system being in particular states) to predict the possibility of a certain event based on which other events have happened. The identification of probabilistic relationships between the different events means that Markov Chains and Bayesian networks often come up in the same discussions. See also Bayesian network. Monte Carlo method A commercial computer language and environment popular for visualization and algorithm development. (Plural: matrices ) An older Websters dictionary with a heavier emphasis on typographical representation gives the mathematical definition as a set of numbers or terms arranged in rows and columns between parentheses or double lines websters. For purposes of manipulating a matrix with software, think of it as a two-dimensional array. As with its one-dimensional equivalent, a vector, this mathematical representation of the two-dimensional array makes it easier to take advantage of software libraries that apply advanced mathematical operations to the dataincluding libraries that can distribute the processing across multiple processors for scalability. See also vector. linear algebra The average value, although technically that is known as the arithmetic mean. (Other means include the geometric and harmonic means.) See also median. mode Mean Absolute Error Mean Squared Error Also, MSE . The average of the squares of all the errors found when comparing predicted values with observed values. Squaring them makes the bigger errors count for more, making Mean Squared Error more popular than Mean Absolute Error when quantifying the success of a set of predictions. See also Mean Absolute Error. Root Mean Squared Error When values are sorted, the value in the middle, or the average of the two in the middle if there are an even number of values. See also mean. mode The value that occurs most often in a sample of data. Like the median, the mode cannot be directly calculated stanton although its easy enough to find with a little scripting. For people who work with statistics, mode can also mean data typefor example, whether a value is an integer, a real number, or a date. See also mean. median. scripting A specification of a mathematical (or probabilistic) relationship that exists between different variables. grus Because modeling can mean so many things, the term statistical modeling is often used to more accurately describe the kind of modeling that data scientists do. Monte Carlo method In general, the use of randomly generated numbers as part of an algorithm. Its use with Markov Chains is so popular that people usually refer to the combination with the acronym MCMC. See also Markov Chain The mean (or average) of time series data (observations equally spaced in time, such as per hour or per day) from several consecutive periods is called the moving average . It is called moving because the average is continually recomputed as new time series data becomes available, and it progresses by dropping the earliest value and adding the most recent. parsian See also mean. time series data The analysis of sequences of n items (typically, words in natural language) to look for patterns. For example, trigram analysis examines three-word phrases in the input to look for patterns such as which pairs of words appear most often in the groups of three. The value of n can be something other than three, depending on your needs. This helps to construct statistical models of documents (for example, when automatically classifying them) and to find positive or negative terms associated with a product name. See also computational linguistics. classification naive Bayes classifier A collection of classification algorithms based on Bayes Theorem. It is not a single algorithm but a family of algorithms that all share a common principle, that every feature being classified is independent of the value of any other feature. So for example, a fruit may be considered to be an apple if it is red, round, and about 3 in diameter. A Naive Bayes classifier considers each of these features (red, round, 3 in diameter) to contribute independently to the probability that the fruit is an apple, regardless of any correlations between features. Features, however, arent always independent which is often seen as a shortcoming of the Naive Bayes algorithm and this is why its labeled naive. aylien This naivet makes it much easier to develop implementations of these algorithms that scale way up. See also Bayes Theorem. classification Also, neural net or artificial neural network to distinguish it from the brain, upon which this algorithm is modeled. A robust function that takes an arbitrary set of inputs and fits it to an arbitrary set of outputs that are binary. In practice, Neural Networks are used in deep learning research to match images to features and much more. What makes Neural Networks special is their use of a hidden layer of weighted functions called neurons, with which you can effectively build a network that maps a lot of other functions. Without a hidden layer of functions, Neural Networks would be just a set of simple weighted functions. kirk See also deep learning. backpropagation. perceptron Also, Gaussian distribution . (Carl Friedrich Gauss was an early nineteenth-century German mathematician.) A probability distribution which, when graphed, is a symmetrical bell curve with the mean value at the center. The standard deviation value affects the height and width of the graph. See also mean. probability distribution. standard deviation. binomial distribution. standard normal distribution A database management system that uses any of several alternatives to the relational, table-oriented model used by SQL databases. While this term originally meant not SQL, it has come to mean something closer to not only SQL because the specialized nature of NoSQL database management systems often have them playing specific roles in a larger system that may also include SQL and additional NoSQL systems. See also SQL If your proposed model for a data set says that the value of x is affecting the value of y . then the null hypothesisthe model youre comparing your proposed model with to check whether x really is affecting y says that the observations are all based on chance and that there is no effect. The smaller the P-value computed from the sample data, the stronger the evidence is against the null hypothesis. shin See also P value When you want to get as much (or as little) of something as possible, and the way youll get it is by changing the values of other quantities, you have an optimization problem. To solve an optimization problem, you need to combine your decision variables, constraints, and the thing you want to maximize together into an objective function. The objective is the thing you want to maximize or minimize, and you use the objective function to find the optimum result. milton See also gradient descent Extreme values that might be errors in measurement and recording, or might be accurate reports of rare events. downey See also overfitting A model of training data that, by taking too many of the datas quirks and outliers into account, is overly complicated and will not be as useful as it could be to find patterns in test data. See also outlier. cross-validation Also, p-value . The probability, under the assumption of no effect or no difference (the null hypothesis), of obtaining a result equal to or more extreme than what was actually observed. goodman Its a measure of how surprised you should be if there is no actual difference between the groups, but you got data suggesting there is. A bigger difference, or one backed up by more data, suggests more surprise and a smaller p value. The p value is a measure of surprise, not a measure of the size of the effect. reinhart A lower p value means that your results are more statistically significant. See also null hypothesis An algorithm that determines the importance of something, typically to rank it in a list of search results. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites. googlearchive PageRank is not named for the pages that it ranks but for its inventor, Google co-founder and CEO Larry Page. A Python library for data manipulation popular with data scientists. See also Python Pretty much the simplest neural network is the perceptron, which approximates a single neuron with n binary inputs. It computes a weighted sum of its inputs and fires if that weighted sum is zero or greater. grus See also neural network An older scripting language with roots in pre-Linux UNIX systems. Perl has always been popular for text processing, especially data cleanup and enhancement tasks. See also scripting. data wrangling Pivot tables quickly summarize long lists of data, without requiring you to write a single formula or copy a single cell. But the most notable feature of pivot tables is that you can arrange them dynamically . Say you create a pivot table summary using raw census data. With the drag of a mouse, you can easily rearrange the pivot table so that it summarizes the data based on gender or age groupings or geographic location. The process of rearranging your table is known as pivoting your data: youre turning the same information around to examine it from different angles. macdonald A distribution of independent events, usually over a period of time or space, used to help predict the probability of an event. Like the binomial distribution, this is a discrete distribution. Named for early 19th century French mathematician Simon Denis Poisson. See also spatiotemporal data. discrete variable. binomial distribution The analysis of data to predict future events, typically to aid in business planning. This incorporates predictive modeling and other techniques. Machine learning might be considered a set of algorithms to help implement predictive analytics. The more business-oriented spin of predictive analytics makes it a popular buzz phrase in marketing literature. See also predictive modeling. machine learning. SPSS principal component analysis This algorithm simply looks at the direction with the most variance and then determines that as the first principal component. This is very similar to how regression works in that it determines the best direction to map data to. kirk See also regression In Bayesian inference, we assume that the unknown quantity to be estimated has many plausible values modeled by whats called a prior distribution. Bayesian inference is then using data (that is considered as unchanging) to build a tighter posterior distribution for the unknown quantity. zumel See also Bayes Theorem A probability distribution for a discrete random variable is a listing of all possible distinct outcomes and their probabilities of occurring. Because all possible outcomes are listed, the sum of the probabilities must add to 1.0. levine See also discrete variable A programming language available since 1994 that is popular with people doing data science. Python is noted for ease of use among beginners and great power when used by advanced users, especially when taking advantage of specialized libraries such as those designed for machine learning and graph generation. See also scripting. Pandas When you divide a set of sorted values into groups that each have the same number of values (for example, if you divide the values into two groups at the median), each group is known as a quantile. If there are four groups, we call them quartiles, which is a common way to divide values for discussion and analysis purposes if there are five, we call them quintiles, and so forth. See also median An open-source programming language and environment for statistical computing and graph generation available for Linux, Windows, and Mac. An algorithm used for regression or classification that uses a collection of tree data structures. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree votes for that class. The forest chooses the classification having the most votes (over all the trees in the forest). breiman The term random forest is actually trademarked by its authors. See also classification. vector. decision trees . the more general problem of fitting any kind of model to any kind of data. This use of the term regression is a historical accident it is only indirectly related to the original meaning of the word. downey See also linear regression. logistic regression. principal component analysis A class of machine learning algorithms in which the process is not given specific goals to meet but, as it makes decisions, is instead given indications of whether its doing well or not. For example, an algorithm for learning to play a video game knows that if its score just went up, it must have done something right. See also supervised learning. unsupervised learning Root Mean Squared Error Also, RMSE . The square root of the Mean Squared Error. This is more popular than Mean Squared Error because taking the square root of a figure built from the squares of the observation value errors gives a number thats easier to understand in the units used to measure the original observations. See also Mean Absolute Error. Mean Squared Error. A scripting language that first appeared in 1996. Ruby is popular in the data science community, but not as popular as Python, which has more specialized libraries available for data science tasks. See also scripting. Python Imagine a graph showing, for each month since smartphones originally became available, how many people in the US bought their first one. The line would rise slowly at first, when only the early adopters got them, then quickly as these phones became more popular, and then level off again once nearly everyone had one. This graphs line would form a stretched-out S shape. The S curve applies to many other phenomena and is often mentioned when someone predicts that a rising value will eventually level off. A commercial statistical software suite that includes a programming language also known as SAS. Designating or of a quantity that has magnitude but no direction in space, as volume or temperature n. a scalar quantity: distinguished from vector websters See also vector Generally, the use of a computer language where your program, or script, can be run directly with no need to first compile it to binary code as with with languages such as Java and C. Scripting languages often have simpler syntax than compiled languages, so the process of writing, running, and tweaking scripts can go faster. See also Python. Perl. Ruby. shell As prices vary from day to day, you might expect to see patterns. If the price is high on Monday, you might expect it to be high for a few more days and if its low, you might expect it to stay low. A pattern like this is called serial correlation, because each value is correlated with the next one in the series. To compute serial correlation, we can shift the time series by an interval called a lag, and then compute the correlation of the shifted series with the original. Autocorrelation is another name for serial correlation, used more often when the lag is not 1. downey See also correlation When you use a computers operating system from the command line, youre using its shell. Along with scripting languages such as Perl and Python, Linux-based shell tools (which are either included with or easily available for Mac and Windows machines) such as grep, diff, split, comm, head, and tail are popular for data wrangling. A series of shell commands stored in a file that lets you execute the series by entering the files name is known as a shell script. See also data wrangling. scripting. Perl. Python Time series data that also includes geographic identifiers such as latitude-longitude pairs. See also time series data A commercial statistical software package, or according to the product home page, predictive analytics software. spss The product has always been popular in the social sciences. The company, founded in 1968, was acquired by IBM in 2009. See also predictive analytics The ISO standard query language for relational databases. Variations of this extremely popular language are often available for data storage systems that arent strictly relational watch for the phrase SQL-like. The square root of the variance, and a common way to indicate just how different a particular measurement is from the mean. An observation more than three standard deviations away from the mean can be considered quite rare, in most applications. zumel Statistical software packages offer automated ways to calculate the standard deviation. See also variance standard normal distribution A normal distribution with a mean of 0 and a standard deviation of 1. When graphed, its a bell-shaped curve centered around the y axis, where x 0. See also normal distribution. mean. standard deviation Also, standard score . normal score . z-score . Transforms a raw score into units of standard deviation above or below the mean. This translates the scores so they can be evaluated in reference to the standard normal distribution. boslaugh Translating two different test sets to use standardized scores makes them easier to compare. See also standard deviation. mean. standard normal distribution A commercial statistical software package, not to be confused with strata. See also strata, stratified sampling strata, stratified sampling Divide the population units into homogeneous groups (strata) and draw a simple random sample from each group. gonick Strata also refers to an OReilly conference on big data, data science, and related technologies. See also Stata A type of machine learning algorithm in which a system is taught to classify input into specific, known classes. The classic example is sorting email into spam versus ham. See also unsupervised learning. reinforcement learning. machine learning support vector machine Also, SVM . Imagine that you want to write a function that draws a line on a two-dimensional x - y graph that separates two different kinds of pointsthat is, it classifies them into two categoriesbut you cant, because on that graph theyre too mixed together. Now imagine that the points are in three dimensions, and you can classify them by writing a function that describes a plane that can be positioned at any angle and position in those three dimensions, giving you more opportunities to find a working mathematical classifier. This plane that is one dimension less than the space around it, such as a two-dimensional plane in a three-dimensional space or a one-dimensional line on a two-dimensional space, is known as a hyperplane. A support vector machine is a supervised learning classification tool that seeks a dividing hyperplane for any number of dimensions. (Keep in mind that dimensions dont have to be x . y . and z position coordinates, but any features you choose to drive the categorization.) SVMs have also been used for regression tasks as well as categorization tasks. See also supervised learning. feature Also, students t distribution . A variation on normal distribution that accounts for the fact that youre only using a sampling of all the possible values instead of all of them. Invented by Guiness Brewery statistician William Gossett (publishing under the pseudonym student) in the early 20th century for his quality assurance work there. See also normal distribution A commercial data visualization package often used in data science projects. time series data Strictly speaking, a time series is a sequence of measurements of some quantity taken at different times, often but not necessarily at equally spaced intervals. boslaugh So, time series data will have measurements of observations (for example, air pressure or stock prices) accompanied by date-time stamps. See also spatiotemporal data. moving average The Unstructured Information Management Architecture was developed at IBM as a framework to analyze unstructured information, especially natural language. OASIS UIMA is a specification that standardizes this framework and Apache UIMA is an open-source implementation of it. The framework lets you pipeline other tools designed to be plugged into it. See also computational linguistics. GATE A class of machine learning algorithms designed to identify groupings of data without knowing in advance what the groups will be. See also supervised learning. reinforcement learning. clustering . How much a list of numbers varies from the mean (average) value. It is frequently used in statistics to measure how large the differences are in a set of numbers. It is calculated by averaging the squared difference of every number from the mean. segaran Any statistical package will offer an automated way to calculate this. See also mean. bias. standard deviation Websters first mathematical definition is a mathematical expression denoting a combination of magnitude and direction, which you may remember from geometry class, but their third definition is closer to how data scientists use the term: an ordered set of real numbers, each denoting a distance on a coordinate axis websters. These numbers may represent a series of details about a single person, movie, product, or whatever entity is being modeled. This mathematical representation of the set of values makes it easier to take advantage of software libraries that apply advanced mathematical operations to the data. See also matrix. linear algebra An open source set of command line and graphical user interface data analysis tools developed at the University of Waikato in New Zealand. References Sarah Boslaugh, Statistics in a Nutshell . 2nd Edition (Sebastopol: OReilly Media, 2012). David M. Bourg and Glenn Seeman AI for Game Developers (Sebastopol: OReilly Media, 2004). Leo Breiman and Adele Cutler, Random Forests. accessed 2015-08-22. Allen B. Downey Think Stats . 2nd Edition (Sebastopol: OReilly Media, 2014). Larry Gonick and Woolcott Smith, The Cartoon Guide to Statistics (New York: HarperCollins, 1993) S. N. Goodman, Toward evidence-based medical statistics. 1: The P value fallacy . Annals of Internal Medicine, 130:9951004, 1999. (quoted in Reinhart ) Mahmoud Parsian, Data Algorithms . (Sebastopol: OReilly Media, 2015). 82. Stanton, J. M. (2012). Introduction to Data Science . Third Edition. iTunes Open Source eBook. Available: itunes. appleusbookintroduction-to-data-scienceid529088127mt11 Victoria Neufeldt, Editor in Chief, Websters New World College Dictionary . Third Edition (New York: Macmillan, 1997). Nina Zumel and John Mount, Practical Data Science with R (Shelter Island: Manning Publications, 2014).Eva Goldwater Biostatistics Consulting Center University of Massachusetts School of Public Health updated February 2007 At A Glance We used Excel to do some basic data analysis tasks to see whether it is a reasonable alternative to using a statistical package for the same tasks. We concluded that Excel is a poor choice for statistical analysis beyond textbook examples, the simplest descriptive statistics, or for more than a very few columns. The problems we encountered that led to this conclusion are in four general areas : Missing values are handled inconsistently, and sometimes incorrectly. Data organization differs according to analysis, forcing you to reorganize your data in many ways if you want to do many different analyses. Many analyses can only be done on one column at a time, making it inconvenient to do the same analysis on many columns. Output is poorly organized, sometimes inadequately labeled, and there is no record of how an analysis was accomplished. Excel is convenient for data entry, and for quickly manipulating rows and columns prior to statistical analysis. However when you are ready to do the statistical analysis, we recommend the use of a statistical package such as SAS, SPSS, Stata, Systat or Minitab. Introduction Excel is probably the most commonly used spreadsheet for PCs. Newly purchased computers often arrive with Excel already loaded. It is easily used to do a variety of calculations, includes a collection of statistical functions, and a Data Analysis ToolPak. As a result, if you suddenly find you need to do some statistical analysis, you may turn to it as the obvious choice. We decided to do some testing to see how well Excel would serve as a Data Analysis application. To present the results, we will use a small example. The data for this example is fictitious. It was chosen to have two categorical and two continuous variables, so that we could test a variety of basic statistical techniques. Since almost all real data sets have at least a few missing data points, and since the ability to deal with missing data correctly is one of the features that we take for granted in a statistical analysis package, we introduced two empty cells in the data: Each row of the spreadsheet represents a subject. The first subject received Treatment 1, and had Outcome 1. X and Y are the values of two measurements on each subject. We were unable to get a measurement for Y on the second subject, or on X for the last subject, so these cells are blank. The subjects are entered in the order that the data became available, so the data is not ordered in any particular way. We used this data to do some simple analyses and compared the results with a standard statistical package. The comparison considered the accuracy of the results as well as the ease with which the interface could be used for bigger data sets - i. e. more columns. We used SPSS as the standard, though any of the statistical packages OIT supports would do equally well for this purpose. In this article when we say quota statistical package, quot we mean SPSS, SAS, STATA, SYSTAT, or Minitab. Most of Excels statistical procedures are part of the Data Analysis tool pack, which is in the Tools menu. It includes a variety of choices including simple descriptive statistics, t-tests, correlations, 1 or 2-way analysis of variance, regression, etc. If you do not have a Data Analysis item on the Tools menu, you need to install the Data Analysis ToolPak. Search in Help for quotData Analysis Toolsquot for instructions on loading the ToolPak. Two other Excel features are useful for certain analyses, but the Data Analysis tool pack is the only one that provides reasonably complete tests of statistical significance. Pivot Table in the Data menu can be used to generate summary tables of means, standard deviations, counts, etc. Also, you could use functions to generate some statistical measures, such as a correlation coefficient. Functions generate a single number, so using functions you will likely have to combine bits and pieces to get what you want. Even so, you may not be able to generate all the parts you need for a complete analysis. Unless otherwise stated, all statistical tests using Excel were done with the Data Analysis ToolPak. In order to check a variety of statistical tests, we chose the following tasks: Get means and standard deviations of X and Y for the entire group, and for each treatment group. Get the correlation between X and Y. Do a two sample t-test to test whether the two treatment groups differ on X and Y. Do a paired t-test to test whether X and Y are statistically different from each other. Compare the number of subjects with each outcome by treatment group, using a chi-squared test. All of these tasks are routine for a data set of this nature, and all of them could be easily done using any of the aobve listed statistical packages. General Issues Enable the Analysis ToolPak The Data Analysis ToolPak is not installed with the standard Excel setup. Look in the Tools menu. If you do not have a Data Analysis item, you will need to install the Data Analysis tools. Search Help for quotData Analysis Toolsquot for instructions. Missing Values A blank cell is the only way for Excel to deal with missing data. If you have any other missing value codes, you will need to change them to blanks. Data Arrangement Different analyses require the data to be arranged in various ways. If you plan on a variety of different tests, there may not be a single arrangement that will work. You will probably need to rearrange the data several ways to get everything you need. Dialog Boxes Choose ToolsData Analysis, and select the kind of analysis you want to do. The typical dialog box will have the following items: Input Range: Type the upper left and lower right corner cells. e. g. A1:B100. You can only choose adjacent rows and columns. Unless there is a checkbox for grouping data by rows or columns (and there usually is not), all the data is considered as one glop. Labels - There is sometimes a box you can check off to indicate that the first row of your sheet contains labels. If you have labels in the first row, check this box, and your output MAY be labeled with your label. Then again, it may not. Output location - New Sheet is the default. Or, type in the cell address of the upper left corner of where you want to place the output in the current sheet. New Worksheet is another option, which I have not tried. Ramifications of this choice are discussed below. Other items, depending on the analysis. Output location The output from each analysis can go to a new sheet within your current Excel file (this is the default), or you can place it within the current sheet by specifying the upper left corner cell where you want it placed. Either way is a bit of a nuisance. If each output is in a new sheet, you end up with lots of sheets, each with a small bit of output. If you place them in the current sheet, you need to place them appropriately leave room for adding comments and labels changes you need to make to format one output properly may affect another output adversely. Example: Output from Descriptives has a column of labels such as Standard Deviation, Standard Error, etc. You will want to make this column wide in order to be able to read the labels. But if a simple Frequency output is right underneath, then the column displaying the values being counted, which may just contain small integers, will also be wide. Results of Analyses Descriptive Statistics The quickest way to get means and standard deviations for a entire group is using Descriptives in the Data Analysis tools. You can choose several adjacent columns for the Input Range (in this case the X and Y columns), and each column is analyzed separately. The labels in the first row are used to label the output, and the empty cells are ignored. If you have more, non-adjacent columns you need to analyze, you will have to repeat the process for each group of contiguous columns. The procedure is straightforward, can manage many columns reasonably efficiently, and empty cells are treated properly. To get the means and standard deviations of X and Y for each treatment group requires the use of Pivot Tables (unless you want to rearrange the data sheet to separate the two groups). After selecting the (contiguous) data range, in the Pivot Table Wizards Layout option, drag Treatment to the Row variable area, and X to the Data area. Double click on ldquoCount of Xrdquo in the Data area, and change it to Average. Drag X into the Data box again, and this time change Count to StdDev. Finally, drag X in one more time, leaving it as Count of X. This will give us the Average, standard deviation and number of observations in each treatment group for X. Do the same for Y, so we will get the average, standard deviation and number of observations for Y also. This will put a total of six items in the Data box (three for X and three for Y). As you can see, if you want to get a variety of descriptive statistics for several variables, the process will get tedious. A statistical package lets you choose as many variables as you wish for descriptive statistics, whether or not they are contiguous. You can get the descriptive statistics for all the subjects together, or broken down by a categorical variable such as treatment. You can select the statistics you want to see once, and it will apply to all variables chosen. Correlations Using the Data Analysis tools, the dialog for correlations is much like the one for descriptives - you can choose several contiguous columns, and get an output matrix of all pairs of correlations. Empty cells are ignored appropriately. The output does NOT include the number of pairs of data points used to compute each correlation (which can vary, depending on where you have missing data), and does not indicate whether any of the correlations are statistically significant. If you want correlations on non-contiguous columns, you would either have to include the intervening columns, or copy the desired columns to a contiguous location. A statistical package would permit you to choose non-contiguous columns for your correlations. The output would tell you how many pairs of data points were used to compute each correlation, and which correlations are statistically significant. Two-Sample T-test This test can be used to check whether the two treatment groups differ on the values of either X or Y. In order to do the test you need to enter a cell range for each group. Since the data were not entered by treatment group, we first need to sort the rows by treatment. Be sure to take all the other columns along with treatment, so that the data for each subject remains intact . After the data is sorted, you can enter the range of cells containing the X measurements for each treatment. Do not include the row with the labels, because the second group does not have a label row. Therefore your output will not be labeled to indicate that this output is for X. If you want the output labeled, you have to copy the cells corresponding to the second group to a separate column, and enter a row with a label for the second group. If you also want to do the t-test for the Y measurements, youll need to repeat the process. The empty cells are ignored, and other than the problems with labeling the output, the results are correct. A statistical package would do this task without any need to sort the data or copy it to another column, and the output would always be properly labeled to the extent that you provide labels for your variables and treatment groups. It would also allow you to choose more than one variable at a time for the t-test (e. g. X and Y). Paired t-test The paired t-test is a method for testing whether the difference between two measurements on the same subject is significantly different from 0. In this example, we wish to test the difference between X and Y measured on the same subject. The important feature of this test is that it compares the measurements within each subject. If you scan the X and Y columns separately, they do not look obviously different. But if you look at each X-Y pair, you will notice that in every case, X is greater than Y. The paired t-test should be sensitive to this difference. In the two cases where either X or Y is missing, it is not possible to compare the two measures on a subject. Hence, only 8 rows are usable for the paired t-test. When you run the paired t-test on this data, you get a t-statistic of 0.09, with a 2-tail probability of 0.93. The test does not find any significant difference between X and Y. Looking at the output more carefully, we notice that it says there are 9 observations. As noted above, there should only be 8. It appears that Excel has failed to exclude the observations that did not have both X and Y measurements. To get the correct results copy X and Y to two new columns and remove the data in the cells that have no value for the other measure. Now re-run the paired t-test. This time the t-statistic is 6.14817 with a 2-tail probability of 0.000468. The conclusion is completely different Of course, this is an extreme example. But the point is that Excel does not calculate the paired t-test correctly when some observations have one of the measurements but not the other. Although it is possible to get the correct result, you would have no reason to suspect the results you get unless you are sufficiently alert to notice that the number of observations is wrong. There is nothing in online help that would warn you about this issue. Interestingly, there is also a TTEST function, which gives the correct results for this example. Apparently the functions and the Data Analysis tools are not consistent in how they deal with missing cells. Nevertheless, I cannot recommend the use of functions in preference to the Data Analysis tools, because the result of using a function is a single number - in this case, the 2-tail probability of the t-statistic. The function does not give you the t-statistic itself, the degrees of freedom, or any number of other items that you would want to see if you were doing a statistical test. A statistical packages will correctly exclude the cases with one of the measurements missing, and will provide all the supporting statistics you need to interpret the output. Crosstabulation and Chi-Squared Test of Independence Our final task is to count the two outcomes in each treatment group, and use a chi-square test of independence to test for a relationship between treatment and outcome. In order to count the outcomes by treatment group, you need to use Pivot Tables. In the Pivot Table Wizards Layout option, drag Treatment to Row, Outcome to Column and also to Data. The Data area should say quotCount of Outcomequot ndash if not, double-click on it and select quotCountquot. If you want percents, double-click quotCount of Outcomequot, and click Options in the ldquoShow Data Asrdquo box which appears, select quot of rowquot. If you want both counts and percents, you can drag the same variable into the Data area twice, and use it once for counts and once for percents. Getting the chi-square test is not so simple, however. It is only available as a function, and the input needed for the function is the observed counts in each combination of treatment and outcome (which you have in your pivot table), and the expected counts in each combination. Expected counts What are they How do you get them If you have sufficient statistical background to know how to calculate the expected counts, and can do Excel calculations using relative and absolute cell addresses, you should be able to navigate through this. If not, youre out of luck. Assuming that you surmounted the problem of expected counts, you can use the Chitest function to get the probability of observing a chi-square value bigger than the one for this table. Again, since we are using functions, you do not get many other necessary pieces of the calculation, notably the value of the chi-square statistic or its degrees of freedom. No statistical package would require you to provide the expected values before computing a chi-square test of indepencence. Further, the results would always include the chi-square statistic and its degrees of freedom, as well as its probability. Often you will get some additional statistics as well. Additional Analyses The remaining analyses were not done on this data set, but some comments about them are included for completeness. Simple Frequencies You can use Pivot Tables to get simple frequencies. (see Crosstabulations for more about how to get Pivot Tables.) Using Pivot Tables, each column is considered a separate variable, and labels in row 1 will appear on the output. You can only do one variable at a time. Another possibility is to use the Frequencies function. The main advantage of this method is that once you have defined the frequencies function for one column, you can use CopyPaste to get it for other columns. First, you will need to enter a column with the values you want counted (bins). If you intend to do the frequencies for many columns, be sure to enter values for the column with the most categories. e. g. if 3 columns have values of 1 or 2, and the fourth has values of 1,2,3,4, you will need to enter the bin values as 1,2,3,4. Now select enough empty cells in one column to store the results - 4 in this example, even if the current column only has 2 values. Next choose InsertFunctionStatisticalFrequencies on the menu. Fill in the input range for the first column you want to count using relative addresses (e. g. A1:A100). Fill in the Bin Range using the absolute addresses of the locations where you entered the values to be counted (e. g. M1:M4). Click Finish. Note the box above the column headings of the sheet, where the formula is displayed. It start with quot FREQUENCIES(quot. Place the cursor to the left of the sign in the formula, and press Ctrl-Shift-Enter. The frequency counts now appear in the cells you selected. To get the frequency counts of other columns, select the cells with the frequencies in them, and choose EditCopy on the menu. If the next column you want to count is one column to the right of the previous one, select the cell to the right of the first frequency cell, and choose EditPaste (ctrl-V). Continue moving to the right and pasting for each column you want to count. Each time you move one column to the right of the original frequency cells, the column to be counted is shifted right from the first column you counted. If you want percents as well, yoursquoll have to use the Sum function to compute the sum of the frequencies, and define the formula to get the percent for one cell. Select the cell to store the first percent, and type the formula into the formula box at the top of the sheet - e. g. N1100N5 - where N1 is the cell with the frequency for the first category, and N5 is the cell with the sum of the frequencies. Use CopyPaste to get the formula for the remaining cells of the first column. Once you have the percents for one column, you can CopyPaste them to the other columns. Yoursquoll need to be careful about the use of relative and absolute addresses In the example above, we used N5 for the denominator, so when we copy the formula down to the next frequency on the same column, it will still look for the sum in row 5 but when we copy the formula right to another column, it will shift to the frequencies in the next column. Finally, you can use Histogram on the Data Analysis menu. You can only do one variable at a time. As with the Frequencies function, you must enter a column with quotbinquot boundaries. To count the number of occurrences of 1 and 2, you need to enter 0,1,2 in three adjacent cells, and give the range of these three cells as the Bins on the dialog box. The output is not labeled with any labels you may have in row 1, nor even with the column letter. If you do frequencies on lots of variables, you will have difficulty knowing which frequency belongs to which column of data. Linear Regression Since regression is one of the more frequently used statistical analyses, we tried it out even though we did not do a regression analysis for this example. The Regression procedure in the Data Analysis tools lets you choose one column as the dependent variable, and a set of contiguous columns for the independents. However, it does not tolerate any empty cells anywhere in the input ranges, and you are limited to 16 independent variables. Therefore, if you have any empty cells, you will need to copy all the columns involved in the regression to new columns, and delete any rows that contain any empty cells. Large models, with more than 16 predictors, cannot be done at all. Analysis of Variance In general, the Excels ANOVA features are limited to a few special cases rarely found outside textbooks, and require lots of data re-arrangements. One-way ANOVA Data must be arranged in separate and adjacent columns (or rows) for each group. Clearly, this is not conducive to doing 1-ways on more than one grouping. If you have labels in row 1, the output will use the labels. Two-Factor ANOVA Without Replication This only does the case with one observation per cell (i. e. no Within Cell error term). The input range is a rectangular arrangement of cells, with rows representing levels of one factor, columns the levels of the other factor, and the cell contents the one value in that cell. Two-Factor ANOVA with Replicates This does a two-way ANOVA with equal cell sizes . Input must be a rectangular region with columns representing the levels of one factor, and rows representing replicates within levels of the other factor. The input range MUST also include an additional row at the top, and column on the left, with labels indicating the factors. However, these labels are not used to label the resulting ANOVA table. Click Help on the ANOVA dialog for a picture of what the input range must look like. Requesting Many Analyses If you had a variety of different statistical procedures that you wanted to perform on your data, you would almost certainly find yourself doing a lot of sorting, rearranging, copying and pasting of your data. This is because each procedure requires that the data be arranged in a particular way, often different from the way another procedure wants the data arranged. In our small test, we had to sort the rows in order to do the t-test, and copy some cells in order to get labels for the output. We had to clear the contents of some cells in order to get the correct paired t-test, but did not want those cells cleared for some other test. And we were only doing five tasks. It does not get better when you try to do more. There is no single arrangement of the data that would allow you to do many different analyses without making many different copies of the data. The need to manipulate the data in many ways greatly increases the chance of introducing errors. Using a statistical program, the data would normally be arranged with the rows representing the subjects, and the columns representing variables (as they are in our sample data). With this arrangement you can do any of the analyses discussed here, and many others as well, without having to sort or rearrange your data in any way. Only much more complex analyses, beyond the capabilities of Excel and the scope of this article would require data rearrangement. Working with Many Columns What if your data had not 4, but 40 columns, with a mix of categorical and continuous measures How easily do the above procedures scale to a larger problem At best, some of the statistical procedures can accept multiple contiguous columns for input, and interpret each column as a different measure. The descriptives and correlations procedures are of this type, so you can request descriptive statistics or correlations for a large number of continuous variables, as long as they are entered in adjacent columns. If they are not adjacent, you need to rearrange columns or use copy and paste to make them adjacent. Many procedures, however, can only be applied to one column at a time. T-tests (either independent or paired), simple frequency counts, the chi-square test of independence, and many other procedures are in this class. This would become a serious drawback if you had more than a handful of columns, even if you use cut and paste or macros to reduce the work. In addition to having to repeat the request many times, you have to decide where to store the results of each, and make sure it is properly labeled so you can easily locate and identify each output. Finally, Excel does not give you a log or other record to track what you have done. This can be a serious drawback if you want to be able to repeat the same (or similar) analysis in the future, or even if youve simply forgotten what youve already done. Using a statistical package, you can request a test for as many variables as you need at once. Each one will be properly labeled and arranged in the output, so there is no confusion as to whats what. You can also expect to get a log, and often a set of commands as well, which can be used to document your work or to repeat an analysis without having to go through all the steps again. Although Excel is a fine spreadsheet, it is not a statistical data analysis package. In all fairness, it was never intended to be one. Keep in mind that the Data Analysis ToolPak is an quotadd-inquot - an extra feature that enables you to do a few quick calculations. So it should not be surprising that that is just what it is good for - a few quick calculations. If you attempt to use it for more extensive analyses, you will encounter difficulties due to any or all of the following limitations: Potential problems with analyses involving missing data. These can be insidious, in that the unwary user is unlikely to realize that anything is wrong. Lack of flexibility in analyses that can be done due to its expectations regarding the arrangement of data. This results in the need to cutpastesort and otherwise rearrange the data sheet in various ways, increasing the likelyhood of errors. Output scattered in many different worksheets, or all over one worksheet, which you must take responsibility for arranging in a sensible way. Output may be incomplete or may not be properly labeled, increasing possibility of misidentifying output. Need to repeat requests for the some analyses multiple times in order to run it for multiple variables, or to request multiple options. Need to do some things by defining your own functionsformulae, with its attendant risk of errors. No record of what you did to generate your results, making it difficult to document your analysis, or to repeat it at a later time, should that be necessary. If you have more than about 10 or 12 columns, andor want to do anything beyond descriptive statistics and perhaps correlations, you should be using a statistical package. There are several suitable ones available by site license through OIT, or you can use them in any of the OIT PC labs. If you have Excel on your own PC, and dont want to pay for a statistical program, by all means use Excel to enter the data (with rows representing the subjects, and columns for the variables). All the mentioned statistical packages can read Excel files, so you can do the (time-consuming) data entry at home, and go to the labs to do the analysis. A much more extensive discussion of the pitfalls of using Excel, with many additional links, is available at burns-stat Click on Tutorials, then Spreadsheet Addiction. For assistance or more information about statistical software, contact the Biostatistics Consulting Center. Telephone 545-2949
Comments
Post a Comment