Mastering Java for Data Science
上QQ阅读APP看书,第一时间看更新

Natural Language Processing

Processing natural language texts is very complex, they are not very well structured and require a lot of cleaning and normalizing. Yet the amount of textual information around us is tremendous: a lot of text data is generated every minute, and it is very hard to retrieve useful information from them. Using data science and machine learning is very helpful for text problems as well; they allow us to find the right text, process it, and extract the valuable bits of information.

There are multiple ways we can use the text information. One example is information retrieval, or, simply, text search--given a user query and a collection of documents, we want to find what are the most relevant documents in the corpus with respect to the query, and present them to the user. Other applications include sentiment analysis--predicting whether a product review is positive, neutral or negative, or grouping the reviews according to how they talk about the products. 

We will talk more about information retrieval, Natural Language Processing (NLP) and working with texts in Chapter 6, Working with Text - Natural Language Processing and Information Retrieval. Additionally, we will see how to process large amounts of text data in Chapter 9Scaling Data Science.  

The methods we can use for machine learning and data science are very important. What is equally important is the the way we create them and then put them to use in production systems. Data science process models help us make it more organized and systematic, which is why we will talk about them next.