preface part ⅰ.buil scrapers 1.your first web scraper connecting an introduction to beautifulsoup installing beautifulsoup running beautifulsoup connecting reliably and handling exceptions 2.advanced html parsing you dont always need a hammer another serving of beautifulsoup findo and findallo with beautifulsoup other beautifulsoup objects navigating trees regular expressions regular expressions and beautifulsoup accessing attributes lambda expressions 3.writing web crawlers traversing a single domain crawling an entire site collecting data across an entire site crawling across the inter 4.web crawling models nning and defining objects dealing with different website layouts structuring crawlers crawling sites through search crawling sites through links crawling multiple page types thinking about web crawler models 5.scrapy installing scrapy initializing a new spider writing a simple scraper spidering with rules creating items outputting items the item pipeline logging with scrapy more resources 6.st0ring data media files storing data to csv mysql installing mysql some basic mands integrating with python database techniques and good practice "six degrees" in mysql e part ⅱ.advanced scraping 7.rea documents document enco text text enco and the global inter csv rea csv files pdf microsoft word and .docx 8.cleaning your dirty data cleaning in code data normalization cleaning after the fact openrefine 9.rea and writing natural languages summarizing data markov models six degrees of wikipedia:conclusion natural language toolkit installation and setup statistical analysis with nltk lecographical analysis with nltk additional resources 10.crawling through forms and logins python requests library submitting a basic form radio buttons,checkboxes,and other inputs submitting files and images handling logins and cookies basic access authentication other form problems 11.scraping javascript a brief introduction to javascript mon javascript libraries ajax and dynamic html executing javascript in python with selenium additional selenium webdrivers handling redirects a final note on javascript 12.crawling through apis a brief introduction to apis methods and apis more about api responses parsing json undocumented apis fin undocumented apis documenting undocumented apis fin and documenting apis automatically bining apis with other data sources more about apis 13.image processing and text recognition overview of libraries pillow tesseract numpy processing well-formatted text adjusting images automatically scraping text from images on websites rea captchas and training tesseract training tesseract retrieving captchas and submitting solutions 14.avoi scraping tra a note on ethics looking like a human adjust your headers handling cookies with javascript timing is everything mon form security features hidden input field values avoi honeypots the human checklist 15.testing your website with scrapers an introduction to testing what are unit tests? python unittest testing wikipedia testing with selenium interacting with the site unittest or selenium? 16.web crawling in parallel processes versus threads multithreaded crawling race conditions and queues the threa module multiprocess crawling multiprocess crawling municating between processes multiprocess crawling--another approach 17.scraping rem0tely why use remote servers? avoi ip address blocking portability and extensibility tor pysocks remote hosting running from a website-hosting account running from the cloud additional resources 18.the legalities and ethics of web scraping trademarks,copyrights,patents,oh my! copyright law trespass to chattels the puter fraud and abuse act robots.txt and terms of service three web scrapers ebay versus bidders edge and trespass to chattels united states v.auernheimer and the puter fraud and abuse act field v.google:copyright and robots.txt moving forward index
以下为对购买帮助不大的评价