Friday, December 13, 2019

Data mining Free Essays

Objective There are many websites and newspapers giving predictions in this direction, but there is no tool which can give mathematical analysis about the races. For my Data Mining Project I will use a database collected from www. Greyhound-data. We will write a custom essay sample on Data mining or any similar topic only for you Order Now Com, then I will use this data in Reprimanded to generate a random race sample and finally I will predict the winner of the race using the same tool. Database The database collected is comprised of 100 examples with 11 dimensions: 1. Place – which represents the national rank 2. Name – II/II represents the land of standing/land of 3. Land of Birth 4. Land of Standing 5. Year of birth 6. Sex – male or female 7. Sire – father’s name 8. Dam – mother’s name (the last two dimensions are considered important in ambling) 9. Races – the number of races for 2014 10. Points – how many points each dog heave accumulated in 2014 11. Bag Didst – the average distance of races. All the details are based on 2014 statistics collected from the website up mentioned. On top of these dimensions I manually added three more: 1. Weight – in Keg 2. Owner 3. Color The last three heave missing data, which make the dataset noisy but I will try to find the best way to recover the missing data. After importing the dataset in Dynamiting from an Excel file, first I analyses the data, then I separated clean data from dirty ATA (no_missing_attributes function). As a result, only 29 items were perfect data, while 71 had missing values (noisy). As we can see in the picture the missing values are highlighted in red. Removing Noise First method used to remove the noise is using the â€Å"average† function provided by Reprimanded. A graphical representation of the design of this method can be seen in the next picture. With this method I replaced â€Å"all† missing values with the â€Å"average†. Generate a Sample Next step is to generate a sample of six items because this is the number of dogs competing in a race. This sample is random generated and the result is: As we can see highlighted in red the national rank is close, which means that the race will be very tight and very hard to predict as well. In the last results I noticed that there is some data that I do not need to use for my final analysis and I decided to remove it. To do this I used â€Å"Remove Useless Attributes† as shown in the next picture: Then the results will look like this: Now is more simple to read data, with only 12 dimensions left. Phase 3 – The Results In this part I will try to predict which of the six dogs will win the race. I will use two ethos, one is the â€Å"Aggregate† function and the other is â€Å"Attribute Generation†. First, I decided to remove some of the attributes as not all of them are actually needed for this operation. To do this, I used â€Å"Select Attribute† function, as shown in the picture below. Six attributes will be enough for the next operation and final operation to find the winner. Next, I will use â€Å"Aggregate† operator and I will use the attribute â€Å"points† to generate the winner. After I add this operator in the design window, one click is needed to display its functions on the right hand sand. After I clicked on â€Å"Edit List†, a Indo opened, where I selected the attribute â€Å"Points† on the left and the â€Å"maximum† function on the left (next picture). Now we can run the process to see the result: As we can see, based on â€Å"Points†, the possible winner is the number one dog on the list because he has the highest number of points. This result can be considered, as the points accumulated are the most important decisional factor when we want to check the â€Å"favorite† for a dog race. But because the points are not the only factor to consider, another method has to be found. Next, I will present another solution, which looks even more interesting. It involves weighting the more than one attribute and this is why this method looks better. I removed â€Å"Aggregate† operator and I added another two instead: â€Å"Set Role† and â€Å"Generate Attribute†. I used Set Role attribute to generate a label (picture below – on the right), in this case I choose name. In the next picture is described the Generate Attribute operator. I clicked â€Å"Edit List† (number 1) on the right hand side and a new window opened. In this window, new attributes can be generated. At number 2 is defined the new attribute name which is â€Å"Winner† in my case, than at number 3 a formula is introduced. The formula weights three attributes â€Å"Weight†, â€Å"Races† and â€Å"Distance†. Based on them, Reprimanded will calculate a score for each dog. The results are shown in the next picture In red is highlighted the winner, number one – Austrian Lisa, and in black is the new generated attribute – â€Å"Winner†, which shows the results for all the competitors. Conclusions This model can be used betting companies like Powdery for example to generate odds for example, but it can be used as well by people who have a passion for gambling. It can be also used to build a website which calculates the winners for future races and attract visitors this way. How to cite Data mining, Papers Data Mining Free Essays Determine the benefits of data mining to the businesses when employing 1. Predictive analytics to understand the behavior of customers Predictive analytics is business intelligence technology that produces a predictive score for each customer or other organizational element. Assigning these predictive scores is the job of a predictive model, which has, in turn been trained over your data, learning from the experience of your organization. We will write a custom essay sample on Data Mining or any similar topic only for you Order Now Predictive analytics optimizes marketing campaigns and website behavior to increase customer responses, conversions and clicks, and to decrease churn. Each customer’s predictive score informs actions to be taken with that customer. 1. Associations discovery in products sold to customers The way in which companies interact with their customers has changed dramatically over the past few years. A customer’s continuing business is no longer guaranteed. As a result, companies have found that they need to understand their customers better, and to quickly respond to their wants and needs. In addition, the time frame in which these responses need to be made has been shrinking. It is no longer possible to wait until the signs of customer dissatisfaction are obvious before action must be taken. To succeed, companies must be proactive and anticipate what a customer desires. For an example in the old days, the storekeepers would simply keep track of all of their customers in their heads, and would know what to do when a customer walked into the store. Today’ store associates face a much more complex situation, more customers, more products, more competitors, and less time to react means that understanding your customers is now much harder to do. A number of forces are working together to increase the complexity of customer relationships, such as compressed marketing cycles, increased marketing costs, and a stream of new product offers. There are many kinds of models, such as linear formulas and business rules. And, for each kind of model, there are all the weights or rules or other mechanics that determine precisely how the predictors are combined. In fact, there are so many choices, it is literally impossible for a person to try them all and find the best one. Predictive analytics is data mining technology that uses the company’s customer data to automatically build a predictive model specialized for the business. This process learns from the organization’s collective experience by leveraging the existing logs of customer purchases, behavior and demographics. The wisdom gained is encoded as the predictive model itself. Predictive modeling software has computer science at its core, undertaking a mixture of number crunching, trial, and error. 2. Web mining to discover business intelligence from Web customers The fast business growth has made both business community and customers face a new situation. Due to intense competition on the one hand and the customer’s option to prefer from a number of alternatives, the business community has realized the essential of intelligent marketing strategies and relationship management. Web servers record and accumulate data about user relations whenever requirements for resources are r eceived. Analyzing the Web access logs can help understand the user behavior and the web structure. From the business and applications point of view, knowledge obtained from the web usage patterns could be directly applied to efficiently manage activities correlated to e-business, e-services and e-education. Accurate web usage information could help to attract new customers, retain current customers, improve cross marketing/sales, effectiveness of promotional campaigns, tracking leaving customers etc. The usage information can be exploited to improve the performance of Web servers by developing proper perfecting and caching strategies so as to decrease the server response time. User profiles could be built by combining users? navigation paths with other data features, such as page viewing time, hyperlink structure, and page content†, according to Sonal Tiwari. 3. Clustering to find related customer information Clustering is a typical unsupervised learning technique for grouping similar data points. A clustering algorithm assigns a large number of data points to a smaller number of groups such that data points in the same group share the same properties while, in different groups, they are dissimilar. Clustering has many applications, including part family formation for group technology, image segmentation, information retrieval, web pages grouping, market segmentation, and scientific and engineering analysis. Many clustering methods have been proposed and they can be broadly classified into four categories such as partitioning methods, hierarchical methods, density-based methods and grid-based methods. Customer clustering is the most important data mining methodologies used in marketing and customer relationship management (CRM). Customer clustering would use customer-purchase transaction data to track buying behavior and create strategic business initiatives. Companies want to keep high-profit, high-value, and low-risk customers. This cluster typically represents the 10 to 20 percent of customers who create 50 to 80 percent of a company’s profits. A company would not want to lose these customers, and the strategic initiative for the segment is obviously retention. A low-profit, high-value, and low-risk customer segment is also an attractive one, and the obvious goal here would be to increase profitability for this segment. Cross-selling (selling new products) and up-selling (selling more of what customers currently buy) to this segment are the marketing initiatives of choice. Assess the reliability of the data mining algorithms. Decide if they can be trusted and predict the errors they are likely to produce. Most methods for validating a data-mining model do not answer business questions directly, but provide the metrics that can be used to guide a business or development decision. There is no comprehensive rule that can tell you when a model is good enough, or when you have enough data. Accuracy is a measure of how well the model correlates an outcome with the attributes in the data that has been provided. There are various measures of accuracy, but all measures of accuracy are dependent on the data that is used. In reality, values might be missing or approximate, or the data might have been changed by multiple processes. Particularly in the phase of exploration and development, you might decide to accept a certain amount of error in the data, especially if the data is fairly uniform in its characteristics. For example, a model that predicts sales for a particular store based on past sales can be strongly correlated and very accurate, even if that store consistently used the wrong accounting method. Therefore, measurements of accuracy must be balanced by assessments of reliability. Reliability assesses the way that a data-mining model performs on different data sets. A data-mining model is reliable if it generates the same type of predictions or finds the same general kinds of patterns egardless of the test data that is supplied. For example, the model that you would use to generate for the store that used the wrong accounting method would not generalize well to other stores, and therefore would not be reliable. Analyze privacy concerns raised by the collection of personal data for mining purposes. 1. Choose and describe three (3) concerns raised by consumers. Recent surveys on privacy show a great concern about the use of personal data for purposes other than the one for which data has been collected. The handling of misinformation can cause serious and long-term damage, so individuals should be able challenge the correctness of data about themselves, such as personal records. The last concern is granulated access to personal information, such as personal information about someone’s health when applying for a job. 2. Decide if each of these concerns is valid and explain your decision for each. These concerns are valid, the first concerned mentioned caused an extreme case to occurred in 1989, collecting over $16 million USD by selling the driver-license data from 19. million Californian residents, the Department of Motor Vehicles in California revised its data selling policy after Robert Brado used their services to obtain the address of actress Rebecca Schaeffer and later killed her in her apartment. While it is very unlikely that KDDM tools will reveal directly precise confidential data, the exploratory Knowledge Discovery and Data Mining (KDDM), tools may correlate or dis close confidential, sensitive facts about individuals resulting in a significant reduction of possibilities. The second concern is valid due to incident happening in Washington; Cablevision fired an employee James Russell Wiggings, on the basis of information obtained from Equifax, Atlanta, about Wiggings’ conviction for cocaine possession; the information was actually about James Ray Wiggings, and the case ended up in court. This illustrates a serious issue in defining property of the data containing personal records. The third issue is For example, employers are obliged to perform a background check when hiring a worker but it is widely accepted that information about diet and exercise habits should not affect hiring decisions. . Describe how each concern is being allayed. KDDM revitalizes some issues and possess new threats to privacy. Some of these can be directly attributed to the fact that this powerful technique may enable the correlation of separate data sets in other to significantly reduce the possible values of private information. Other can be more attributed to the inter pretation, application and actions taken from the inferences obtain with the tools. While this raises concerns, there is a body of knowledge in the field of statistical databases that could potentially be extended and adapted to develop new techniques to balance the rights to privacy and the needs for knowledge and analysis of large volumes of information. Some of these new privacy protection methods are emerging as the application of KDD tools moves to more controversial datasets. Provide at least three (3) examples where businesses have used predictive analysis to gain a competitive advantage and evaluate the effectiveness of each business’s strategy. The first advantage analysis helps when it comes to validity of a product by making a distinction between the positioning of a product and its ability to satisfy customer requirements. Another important attributes include ease of use, innovation, how well the product integrates with other technologies that customers need. The second advantage is the technology provides to customers. Even if a product is well designed, it must be able to help businesses achieve their business goals. Goals range from gaining insight about customers in order to be more competitive, to using the technology to increase revenue. A key attribute that is measured in this dimension is how well the product supports companies in meeting their objectives. The third advantage is the strength of the company’s strategy. It is not enough to simply have a good vision; a company must also have a well-designed road map that can support this vision. Vision attributes also include more tactical aspects of the company’s strategy such as a technology platform that can scale, well-articulated messaging, and positioning. A key component of this dimension is clarity: it must be clear what business problem the company is solving for which customer. References Alexander, D. (2012). Data Mining. Retrieved from: http://www.laits.utexas.edu/~anorman/BUS.FOR/course.mat/Alex/#8 Josh, K. (2012). Analysis of Data Mining Algorithms. Retrieved from: http://www-users.cs.umn.edu/~desikan/research/dataminingoverview.html Exforsys. (2006). Execution for System: Connection between Data Mining and Customer Interaction. Retrieved from: http://www.exforsys.com/tutorials/data-mining/the-connection-between-data-mining-and-customer-interaction.html Frand, J. (1996). Data Mining: What is Data Mining? Retrieved from: http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/index.htm Pupo, E. (2010). HIMSS News: Privacy and Security Concerns in Data Mining. Retrieved from: http://www.himss.org/ASP/ContentRedirector.asp?type=HIMSSNewsItemContentId=73526 Stein, J. (2011). Data Mining: How Companies Now Know Everything About You. Retrieved from: http://www.time.com/time/magazine/article/0,9171,2058205,00.html#ixzz25MwYNhuh How to cite Data Mining, Papers

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.