目录 Preface Part Ⅰ.Building Scrapers 1.Your First Web Scraper Connecting An Introduction to BeautifulSoup Installing BeautifulSoup Running BeautifulSoup Connecting Reliably and Handling Exceptions 2.Advanced HTML Parsing You Dont Always Need a Hammer Another Serving of BeautifulSoup findo and findallo with BeautifulSoup Other BeautifulSoup Objects Navigating Trees Regular Expressions Regular Expressions and BeautifulSoup Accessing Attributes Lambda Expressions 3.Writing Web Crawlers Traversing a Single Domain Crawling an Entire Site Collecting Data Across an Entire Site Crawling Across the Internet 4.Web Crawling Models Planning and Defining Objects Dealing with Different Website Layouts Structuring Crawlers Crawling Sites Through Search Crawling Sites Through Links Crawling Multiple Page Types Thinking About Web Crawler Models 5.Scrapy Installing Scrapy Initializing a New Spider Writing a Simple Scraper Spidering with Rules Creating Items Outputting Items The Item Pipeline Logging with Scrapy More Resources 6.St0ring Data Media Files Storing Data to CSV MySQL Installing MySQL Some Basic Commands Integrating with Python Database Techniques and Good Practice "Six Degrees" in MySQL Email Part Ⅱ.Advanced Scraping 7.Reading Documents Document Encoding Text Text Encoding and the Global Internet CSV Reading CSV Files PDF Microsoft Word and .docx 8.Cleaning Your Dirty Data Cleaning in Code Data Normalization Cleaning After the Fact OpenRefine 9.Reading and Writing Natural Languages Summarizing Data Markov Models Six Degrees of Wikipedia:Conclusion Natural Language Toolkit Installation and Setup Statistical Analysis with NLTK Lexicographical Analysis with NLTK Additional Resources 10.Crawling Through Forms and Logins Python Requests Library Submitting a Basic Form Radio Buttons,Checkboxes,and Other Inputs Submitting Files and Images Handling Logins and Cookies HTTP Basic Access Authentication Other Form Problems 11.Scraping JavaScript A Brief Introduction to JavaScript Common JavaScript Libraries Ajax and Dynamic HTML Executing JavaScript in Python with Selenium Additional Selenium Webdrivers Handling Redirects A Final Note on JavaScript 12.Crawling Through APIs A Brief Introduction to APIs HTTP Methods and APIs More About API Responses Parsing JSON Undocumented APIs Finding Undocumented APIs Documenting Undocumented APIs Finding and Documenting APIs Automatically Combining APIs with Other Data Sources More About APIs 13.Image Processing and Text Recognition Overview of Libraries Pillow Tesseract NumPy Processing Well-Formatted Text Adjusting Images Automatically Scraping Text from Images on Websites Reading CAPTCHAs and Training Tesseract Training Tesseract Retrieving CAPTCHAs and Submitting Solutions 14.Avoiding Scraping Traps A Note on Ethics Looking Like a Human Adjust Your Headers Handling Cookies with JavaScript Timing Is Everything Common Form Security Features Hidden Input Field Values Avoiding Honeypots The Human Checklist 15.Testing Your Website with Scrapers An Introduction to Testing What Are Unit Tests? Python unittest Testing Wikipedia Testing with Selenium Interacting with the Site unittest or Selenium? 16.Web Crawling in Parallel Processes versus Threads Multithreaded Crawling Race Conditions and Queues The threading Module Multiprocess Crawling Multiprocess Crawling Communicating Between Processes Multiprocess Crawling--Another Approach 17.Scraping Rem0tely Why Use Remote Servers? Avoiding IP Address Blocking Portability and Extensibility Tor PySocks Remote Hosting Running from a Website-Hosting Account Running from the Cloud Additional Resources 18.The Legalities and Ethics of Web Scraping Trademarks,Copyrights,Patents,Oh My! Copyright Law Trespass to Chattels The Computer Fraud and Abuse Act robots.txt and Terms of Service Three Web Scrapers eBay versus Bidders Edge and Trespass to Chattels United States v.Auernheimer and The Computer Fraud and Abuse Act Field v.Google:Copyright and robots.txt Moving Forward Index
以下为对购买帮助不大的评价