Delver’s Crawler documentation:¶
-
class
delver.
Crawler
(history=True, max_history=5, absolute_links=True)[source]¶ Browser mimicking object. Mostly wrapper on Requests and Lxml libraries.
Parameters: - history – (optional) bool, turns off/on history usage in Crawler
- max_history – (optional) int, max items held in history
- absolute_links – (optional) bool, makes always all links absolute
Features:
- To some extent, acts like a browser
- Allows visiting pages, form posting, content scraping, cookie handling etc.
- Wraps
requests.Session()
Simple usage:
>>> c = Crawler() >>> response = c.open('https://httpbin.org/html') >>> response.status_code 200
Form submit:
>>> c = Crawler() >>> response = c.open('https://httpbin.org/forms/post') >>> forms = c.forms() Filling up fields values: >>> form = forms[0] >>> form.fields = { ... 'custname': 'Ruben Rybnik', ... 'custemail': 'ruben.rybnik@fakemail.com', ... 'size': 'medium', ... 'topping': ['bacon', 'cheese'], ... 'custtel': '+48606505888' ... } >>> submit_result = c.submit(form) >>> submit_result.status_code 200 Checking if form post ended with success: >>> c.submit_check( ... form, ... phrase="Ruben Rybnik", ... url='https://httpbin.org/post', ... status_codes=[200]) True
Form file upload:
>>> c = Crawler() >>> c.open('http://cgi-lib.berkeley.edu/ex/fup.html') <Response [200]> >>> forms = c.forms() >>> upload_form = forms[0] >>> upload_form.fields = { ... 'note': 'Text file with quote', ... 'upfile': open('test/test_file.txt', 'r') ... } >>> c.submit(upload_form, action='http://cgi-lib.berkeley.edu/ex/fup.cgi') <Response [200]> >>> c.submit_check( ... upload_form, ... phrase="road is easy", ... status_codes=[200] ... ) True
Cookies handling:
>>> c = Crawler() >>> c.open('https://httpbin.org/cookies', cookies={ ... 'cookie_1': '1000101000101010', ... 'cookie_2': 'ABABHDBSBAJSLLWO', ... }) <Response [200]>
Find links:
>>> c = Crawler() >>> c.open('https://httpbin.org/links/10/0') <Response [200]> Links can be filtered by some html tags and filters like: id, text, title and class: >>> links = c.links( ... tags = ('style', 'link', 'script', 'a'), ... filters = { ... 'text': '7' ... }, ... match='NOT_EQUAL' ... ) >>> len(links) 8
Find images:
>>> c = Crawler() >>> c.open('https://www.python.org/') <Response [200]> First image path with 'python-logo' in string: >>> next( ... image_path for image_path in c.images() ... if 'python-logo' in image_path ... ) 'https://www.python.org/static/img/python-logo.png'
Download file:
>>> import os >>> c = Crawler() >>> local_file_path = c.download( ... local_path='test', ... url='https://httpbin.org/image/png', ... name='test.png' ... ) >>> os.path.isfile(local_file_path) True
Download files list in parallel:
>>> c = Crawler() >>> c.open('https://xkcd.com/') <Response [200]> >>> full_images_urls = [c.join_url(src) for src in c.images()] >>> downloaded_files = c.download_files('test', files=full_images_urls) >>> len(full_images_urls) == len(downloaded_files) True
Traversing through history:
>>> c = Crawler() >>> c.open('http://quotes.toscrape.com/') <Response [200]> >>> tags_links = c.links(filters={'class': 'tag'}) >>> c.follow(tags_links[0]) <Response [200]> >>> c.follow(tags_links[1]) <Response [200]> >>> c.follow(tags_links[2]) <Response [200]> >>> history = c.history() >>> c.back() >>> c.get_url() == history[-2].url True
-
add_customized_kwargs
(kwargs)[source]¶ Adds request keyword arguments customized by setting Crawler attributes like proxy, useragent, headers. Arguments won’t be passed if they are already set as open method kwargs.
Wraps RequestsCookieJar object from requests library.
Returns: RequestsCookieJar object
-
current_parser
()[source]¶ Return parser associated with current flow item.
Returns: matched parser object like: class::HtmlParser <HtmlParser> object
-
direct_submit
(url=None, data=None)[source]¶ Direct submit. Used when quick post to form is needed or if there are no forms found by the parser.
Usage:
>>> data = {'name': 'Piccolo'} >>> c = Crawler() >>> result = c.submit(action='https://httpbin.org/post', data=data) >>> result.status_code 200
Parameters: - url – submit url, form action url, str
- data – submit parameters, dict
Returns: class::Response <Response> object
-
download_files
(local_path, files=None, workers=10)[source]¶ Download list of files in parallel.
Parameters: - workers – number of threads
- local_path – download path
- files – list of files
Returns: list with downloaded files paths
-
fit_parser
(response)[source]¶ Fits parser according to response type.
Parameters: response – class::Response <Response> object Returns: matched parser object like: class::HtmlParser <HtmlParser> object
-
forms
(filters=None)[source]¶ Return iterable over forms. Doesn’t find javascript forms yet (but will be).
- example_filters = {
- ‘id’: ‘searchbox’, ‘name’: ‘name, ‘action’: ‘action’, ‘has_fields’: [‘field1’, ‘field2’]
}
Usage:
>>> c = Crawler() >>> response = c.open('http://cgi-lib.berkeley.edu/ex/fup.html') >>> forms = c.forms() >>> forms[0].fields['note'].get('tag') 'input'
-
open
(url, method='get', **kwargs)[source]¶ Opens url. Wraps functionality of Session from Requests library.
Parameters: - url – visiting url str
- method – ‘get’, ‘post’ etc. str
- kwargs – additional keywords like headers, cookies etc.
Returns: class::Response <Response> object
-
request_history
()[source]¶ Returns current request history (like list of redirects to finally accomplish request)