Business Intelligence, 2e (Turban/Sharda/Delen/King) Chapter 5 Text and Web Mining 1) DARPA and MITRE teamed up to develop capabilities to automatically filter text-based information sources to generate actionable information in a timely manner. Answer: TRUE Diff: 2Page Ref: 190 2) A vast majority of business data is captured and stored in text documents that are structured. Answer: FALSE Diff: 2Page Ref: 192 3) Text mining is important to competitive advantage because knowledge is power, and knowledge is derived from text data sources. Answer: TRUE Diff: 2Page Ref: 192 ) The purpose and processes of text mining are different from those of data mining because with text mining the input to the process are data files such as Word documents, PDF files, text excerpts, and XML files. Answer: FALSE Diff: 3Page Ref: 192 5) The benefits of text mining are greatest in areas where very large amounts of textual data are being generated, such as law, academic research, finance, and medicine. Answer: TRUE Diff: 2Page Ref: 192 6) Unstructured data has a predetermined format. It is usually organized into records as categorical, ordinal, and continuous variables and stored in databases.
Answer: FALSE Diff: 2Page Ref: 193 7) Stemming is the process of reducing inflected words to their base or root form. Answer: TRUE Diff: 1Page Ref: 193 8) Stop words, such as a, am, the, and was, are words that are filtered out prior to or after processing of natural language data. Answer: TRUE Diff: 2Page Ref: 193 9) The goal of natural language processing (NLP) is syntax-driven text manipulation. Answer: FALSE Diff: 2Page Ref: 196 10) Two advantages associated with the implementation of NLP are word sense disambiguation and syntactic ambiguity. Answer: FALSE Diff: 2Page Ref: 196 1) By applying a learning algorithm to parsed text, researchers from Stanford University’s NLP lab have developed methods that can automatically identify the concepts and relationships between those concepts in the text. Answer: TRUE Diff: 2Page Ref: 197 12) Text mining can be used to increase cross-selling and up-selling by analyzing the unstructured data generated by call centers. Answer: TRUE Diff: 1Page Ref: 200 13) Compared to polygraphs for deception-detection, text-based deception detection has the advantages of being nonintrusive and widely applicable to textual data and transcriptions of voice recordings.
Answer: TRUE Diff: 2Page Ref: 201 14) The main purpose of establishing the corpus is to collect all of the documents related to the context being studied. Answer: TRUE Diff: 2Page Ref: 207 15) The main categories of knowledge extraction methods are recall, search, and signaling. Answer: FALSE Diff: 2Page Ref: 210 16) Web pages consisting of unstructured textual data coded in HTML and logs of visitors’ interactions provide rich data that can easily provide effective and efficient knowledge discovery. Answer: FALSE Diff: 3Page Ref: 217 7) Web crawlers are Web content mining tools that are used to read through the content of a Web site automatically. Answer: FALSE Diff: 1Page Ref: 218 18) Amazon. com leverages Web usage history dynamically and recognizes the user by reading a cookie written by a Web site on the visitor’s computer. Answer: TRUE Diff: 1Page Ref: 221 19) The quality of search results is impossible to measure accurately using strictly quantitative measures such as click-through rate, abandonment, and search frequency. Additional quantitative and qualitative measures are required. Answer: TRUE Diff: 2Page Ref: 222 0) Customer experience management applications gather and report direct feedback from site visitors by benchmarking against other sites and offline channels, and by supporting predictive modeling of future visitor behavior. Answer: FALSE Diff: 3Page Ref: 224 21) A vast majority of business data are stored in text documents that are ________. A) mostly quantitative B) virtually unstructured C) semi-structured D) highly structured Answer: B Diff: 1Page Ref: 192 22) Text mining is the semi-automated process of extracting ________ from large amounts of unstructured data sources.
A) patterns B) useful information C) knowledge D) all of the above Answer: D Diff: 2Page Ref: 192 23) All of the following are popular application areas of text mining except: A) information extraction B) document summarization C) question answering D) data structuring Answer: D Diff: 2Page Ref: 193 24) Which of the following correctly defines a text mining term? A) Tagging is the number of times a word is found in a specific document. B) A token is an uncategorized block of text in a sentence. C) Rooting is the process of reducing inflected words to their base form.
D) A term is a single word or multiword phrase extracted directly from the corpus by means of NLP methods. Answer: D Diff: 3Page Ref: 194 25) ________ is a branch of the field of linguistics and a part of natural language processing that studies the internal structure of words. A) Morphology B) Corpus C) Stemming D) Polysemes Answer: A Diff: 2Page Ref: 194 26) Using ________ as a rich source of knowledge and a strategic weapon, Kodak not only survives but excels in its market segment defined by innovation and constant change. A) visualization B) deception detection C) patent analysis D) semantic cues
Answer: C Diff: 2Page Ref: 194 27) It has been shown that the bag-of-word method may not produce good enough information content for text mining tasks. More advanced techniques such as ________ are needed. A) classification B) natural language processing C) evidence-based processing D) symbolic processing Answer: B Diff: 2Page Ref: 195 28) Why will computers probably not be able to understand natural language the same way and with the same accuracy that humans do? A) A true understanding of meaning requires extensive knowledge of a topic beyond what is in the words, sentences, and paragraphs.
B) The natural human language is too specific. C) The part of speech depends only on the definition and not on the context within which it is used. D) All of the above. Answer: A Diff: 3Page Ref: 196 29) At a very high level, the text mining process consists of each of the following tasks except: A) create log frequencies B) establish the corpus C) create the term-document matrix D) extract the knowledge Answer: A Diff: 2Page Ref: 207 30) In ________, the problem is to group an unlabelled collection of objects, such as documents, customer comments, and Web pages into meaningful groups without any prior knowledge.
A) search recall B) classification C) clustering D) grouping Answer: C Diff: 2Page Ref: 211 31) The two main approaches to text classification are ________ and ________. A) knowledge engineering; machine learning B) categorization; clustering C) association; trend analysis D) knowledge extraction; association Answer: A Diff: 2Page Ref: 211 32) Commercial software tools include all of the following except: A) GATE B) IBM Intelligent Miner Data Mining Suite C) SAS Text Miner D) SPSS Text Mining Answer: A Diff: 2Page Ref: 216 33) Why does the Web pose great challenges for effective and efficient knowledge discovery?
A) The Web search engines are indexed-based. B) The Web is too dynamic. C) The Web is too specific to a domain. D) The Web infrastructure contains hyperlink information. Answer: B Diff: 2Page Ref: 217 34) A simple keyword-based search engine suffers from several deficiencies, which include all of the following except: A) a topic of any breath can easily contain hundreds or thousands of documents B) many documents that are highly relevant to a topic may not contain the exact keywords defining them C) web mining can identify authoritative Web pages D) many of the search results are marginally or not relevant to the topic Answer: C
Diff: 3Page Ref: 217 35) Which of the following is not one of the three main areas of Web mining? A) Web search mining B) Web content mining C) Web structure mining D) Web usage mining Answer: A Diff: 2Page Ref: 218 36) Which of the following refers to developing useful information from the links included in the Web documents? A) Web content mining B) Web subject mining C) Web structure mining D) Web matter mining Answer: C Diff: 2Page Ref: 219 37) A ________ is one or more Web pages that provide a collection of links to authoritative pages, reference sites, or a resource list on a specific topic.
A) hub B) hyperlink-induced topic search C) spoke D) community Answer: A Diff: 2Page Ref: 219 38) All of the following are types of data generated through Web page visits except: A) data stored in server access logs, referrer logs, agent logs, and client-side cookies B) user profiles C) hyperlink analysis D) metadata, such as page attributes, content attributes, and usage data Answer: C Diff: 2Page Ref: 220 39) When registered users revisit Amazon. com, they are greeted by name. This task involves recognizing the user by ________. A) pattern discovery B) association C) text mining
D) reading a cookie Answer: D Diff: 1Page Ref: 221 40) Forward-thinking companies like Ask. com, Scholastic, and St. John Health System are actively using Web mining systems to answer important questions of “Who? ” “Why? ” and “How? ” The benefits of integrating these systems: A) are measured qualitatively in terms of customer satisfaction, but not measured using financial or other quantitative measure. B) can be significant in terms of incremental financial growth and increasing customer loyalty and satisfaction. C) have not yet outweighed the costs of the Web mining systems and analysis.
D) can be infinitely measurable. Answer: B Diff: 3Page Ref: 222 41) ________ is the semi-automated process of extracting patterns from large amounts of unstructured data sources. Answer: Text mining Diff: 1Page Ref: 192 42) ________ is the process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases, where the data are organized in records structured by categorical, ordinal, or continuous variables. Answer: Data mining Diff: 1Page Ref: 192 43) ________ is the grouping of similar documents without having a predefined set of categories.
Answer: Clustering Diff: 2Page Ref: 193 44) In linguistics, a(n) ________ is a large and structured set of texts prepared for the purpose of conducting knowledge discovery. Answer: corpus Diff: 1Page Ref: 193 45) ________ is the process of reducing inflected words to their base or root form. Answer: Stemming Diff: 1Page Ref: 193 46) ________ words or noise words are words that are filtered out prior to or after processing of natural language data. Answer: Stop Diff: 1Page Ref: 193 47) The term “stop-words” are used by text mining to ________ commonly used words.
Answer: eliminate Diff: 2Page Ref: 193 48) ________ is an important component of text mining and is a subfield of artificial intelligence and computational linguistics. It studies the problem of understanding the natural human language. Answer: Natural language processing (NLP) Diff: 1Page Ref: 196 49) ________ analysis is a technique used to detect favorable and unfavorable opinions toward specific products and services using textual data sources, such as customer feedback in Web postings and the detection of unfavorable rumors. Answer: Sentiment Diff: 2Page Ref: 197 0) At a very high level, the first of three consecutive tasks in the text mining process is to establish the ________, which is a list of organized documents. Answer: corpus Diff: 1Page Ref: 207 51) In the text mining process, the output of task two is a flat file called a ________ matrix where the cells are populated with the term frequencies. Answer: term-document Diff: 3Page Ref: 207 52) One of the main approaches to text classification is ________ in which an expert’s knowledge is encoded into the system either declaratively or in the form of procedural classification rules.
Answer: knowledge engineering Diff: 2Page Ref: 211 53) A(n) ________ is one or more Web pages that provide a collection of links to authoritative pages. Answer: hub Diff: 1Page Ref: 219 54) ________ mining is the process of extracting useful information from the links embedded in Web documents. Answer: Web structure Diff: 2Page Ref: 219 55) ________ mining is the extraction of useful information from data generated through Web page visits and transactions. Answer: Web usage Diff: 2Page Ref: 220 56) Analysis of the information collected by Web servers can help better understand user behavior.
Analysis of this data is called ________ analysis. Answer: clickstream Diff: 2Page Ref: 220 57) ________ applications focus on “who and how” questions by gathering and reporting direct feedback from site visitors, by benchmarking against other sites and offline channels, and by supporting predictive modeling of future visitor behavior. Answer: Voice of Customer Diff: 2Page Ref: 224 58) Web analytics, CEM, and VOC applications form the foundation of the Web site ________ ecosystem that supports the online business’ ability to positively influence desired outcomes. Answer: optimization Diff: 2Page Ref: 224 9) The ________ model, which is one where multiple sources of data describing the same population are integrated to increase the depth and richness of the resulting analysis, forms the framework of the Web site optimization ecosystem. Answer: convergent validation Diff: 3Page Ref: 225 60) Fundamental to the optimization process is ________, gathering data and information that can then be transformed into tangible analysis and recommendations for improvement using Web mining tools and techniques. Answer: measurement Diff: 3Page Ref: 225 61) Compare and contrast text mining and data mining.
Answer: Text mining is the semi-automated process of extracting patterns (useful information and knowledge) from large amounts of unstructured data sources. Data mining is the process of identifying valid, novel, potentially useful, and understandable patterns in data stored in structured databases, where the data are organized in records structured by categorical, ordinal, or continuous variables. Text mining is the same as data mining in that it has the same purpose and uses the same processes, but with text mining the input to the process is a collection of unstructured data files such as Word documents, PDF files, and so on.
Diff: 2Page Ref: 192 62) Why will computers probably not be able to understand natural language the same way and with the same accuracy that humans do? Answer: Natural human language is vague for computers to understand; and a true understanding of meaning requires extensive knowledge of a topic beyond what is in the words, sentences, and paragraphs. Diff: 1Page Ref: 196 63) NLP has successfully been applied to a variety of tasks via computer programs to automatically process natural human language that previously could only be done by humans.
List three of the most popular of these tasks. Answer: Any three of the following: •Information retrieval. The science of searching for relevant documents, finding specific information within them, and generating metadata as to their contents. •Information extraction. A type of information retrieval whose goal is to automatically extract structured information from a certain domain, using machine-readable documents. •Question answering. The task of automatically answering a question posed in natural language; that is, producing a human-language answer when given a human-language question. Automatic summarization. The creation of a shortened version of a text document by a computer program that contains the most important points of the document. •Natural language generation. Systems convert information from computer databases into readable human language. •Natural language understanding. Systems convert samples of human language into more formal representations that are easier for computer programs to manipulate. •Machine translation. The automatic translation of one human language to another. •Foreign language reading. A computer program that assists a onnative language speaker to read a foreign language. •Foreign language writing. A computer program that assists a nonnative language user in writing in a foreign language. •Speech recognition. Converts spoken words to machine-readable input. •Text-to-speech. A computer program converts normal language text into human speech. •Text proofing. A computer program reads a proof copy of a text in order to detect and correct any errors. •Optical character recognition. The automatic translation of images of handwritten, typewritten, or printed text.
Diff: 2Page Ref: 199 64) Describe a marketing application of text mining. Answer: Text mining can be used to increase cross-selling and up-selling by analyzing the unstructured data generated by call centers. Text generated by call-center notes as well as transcriptions of voice conversations with customers can be analyzed by text mining algorithms to extract novel, actionable information about customers’ perceptions toward a company’s products and services. Text mining is valuable for customer relationship management (CRM).
Companies can use text mining to analyze unstructured text data, combined with the relevant structured data extracted from organizational databases, to predict customer perceptions and subsequent purchasing behavior. Diff: 2Page Ref: 200 65) What is the primary purpose of text mining within the context of knowledge discovery? Answer: The primary purpose of text mining within the context of knowledge discovery is to process unstructured (textual) data along with structured data, if relevant to the problem, to extract meaningful and actionable patterns for better decision making.
Diff: 1Page Ref: 206 66) Diagram and explain the three-step text mining process. Answer: See Figure 5. 5 in the textbook. Diff: 2Page Ref: 207 67) List two options for managing or reducing the dimensionality (size) of the term-document matrix (TDM). Answer: •A domain expert goes through the list of terms and eliminates those that do not make much sense for the context of the study. •Eliminate terms with very few occurrences in very few documents. •Transform the matrix using singular value decomposition. Diff: 3Page Ref: 210 8) What are three of the challenges for effective and efficient knowledge discovery posed by the Web? Answer: The Web is too big for effective data mining. Because of the sheer size of the Web, it is not feasible to set up a data warehouse to replicate, store, and integrate all of the data on the Web, making data collection and integration a challenge. The Web is too complex. The complexity of a Web page is far greater than a page in a traditional text document collection. Web pages lack a unified structure.
The Web is too dynamic. The Web is a highly dynamic information source. Not only does the Web grow rapidly, but its content is constantly being updated. The Web is not specific to a domain. The Web serves a broad diversity of communities and connects billions of workstations. Web users have very different backgrounds, interests, and usage purposes. The Web has everything. Only a small portion of the information on the Web is truly relevant or useful to someone or some task. Diff: 2Page Ref: 217 9) Define the three main areas of Web mining and each area’s source of information. Answer: Web content mining refers to the extraction of useful information from Web pages. Source: unstructured textual content of the Web pages, usually in HTML format. Web structure mining is the process of extracting useful information from the links embedded in Web documents. Source: the URL links contained in the Web pages. Web usage mining is the extraction of useful information from data generated through Web page visits and transactions.
Source: the detailed description of a Web site’s visits. Diff: 2Page Ref: 218 70) List three business applications of Web mining. Answer: 1. Determine the lifetime value of clients. 2. Design cross-marketing strategies across products. 3. Evaluate promotional campaigns. 4. Target electronic ads and coupons at user groups based on user access patterns. 5. Predict user behavior based on previously learned rules and users’ profiles. 6. Present dynamic information to users based on their interests and profiles. Diff: 2Page Ref: 221