File manager - Edit - /home/u478019808/domains/bestandroidphones.store/public_html/static/img/logo/docs.tar
Back
api.rst 0000644 00000004642 15025166611 0006060 0 ustar 00 .. _api: Developer Interfaces ==================== .. module:: charset_normalizer Main Interfaces --------------- Those functions are publicly exposed and are protected through our BC guarantee. .. autofunction:: from_bytes .. autofunction:: from_fp .. autofunction:: from_path .. autofunction:: is_binary .. autoclass:: charset_normalizer.models.CharsetMatches :inherited-members: .. autoclass:: charset_normalizer.models.CharsetMatch :inherited-members: .. autofunction:: detect .. autofunction:: charset_normalizer.utils.set_logging_handler Mess Detector ------------- .. autofunction:: charset_normalizer.md.mess_ratio This library allows you to extend the capabilities of the mess detector by extending the class `MessDetectorPlugin`. .. autoclass:: charset_normalizer.md.MessDetectorPlugin :inherited-members: .. autofunction:: charset_normalizer.md.is_suspiciously_successive_range Coherence Detector ------------------ .. autofunction:: charset_normalizer.cd.coherence_ratio Utilities --------- Some reusable functions used across the project. We do not guarantee the BC in this area. .. autofunction:: charset_normalizer.utils.is_accentuated .. autofunction:: charset_normalizer.utils.remove_accent .. autofunction:: charset_normalizer.utils.unicode_range .. autofunction:: charset_normalizer.utils.is_latin .. autofunction:: charset_normalizer.utils.is_punctuation .. autofunction:: charset_normalizer.utils.is_symbol .. autofunction:: charset_normalizer.utils.is_emoticon .. autofunction:: charset_normalizer.utils.is_separator .. autofunction:: charset_normalizer.utils.is_case_variable .. autofunction:: charset_normalizer.utils.is_cjk .. autofunction:: charset_normalizer.utils.is_hiragana .. autofunction:: charset_normalizer.utils.is_katakana .. autofunction:: charset_normalizer.utils.is_hangul .. autofunction:: charset_normalizer.utils.is_thai .. autofunction:: charset_normalizer.utils.is_unicode_range_secondary .. autofunction:: charset_normalizer.utils.any_specified_encoding .. autofunction:: charset_normalizer.utils.is_multi_byte_encoding .. autofunction:: charset_normalizer.utils.identify_sig_or_bom .. autofunction:: charset_normalizer.utils.should_strip_sig_or_bom .. autofunction:: charset_normalizer.utils.iana_name .. autofunction:: charset_normalizer.utils.range_scan .. autofunction:: charset_normalizer.utils.is_cp_similar .. class:: os.PathLike .. class:: typing.BinaryIO index.rst 0000644 00000004456 15025166611 0006421 0 ustar 00 =================== Charset Normalizer =================== Overview ======== A Library that helps you read text from unknown charset encoding. This project is motivated by chardet, I'm trying to resolve the issue by taking another approach. All IANA character set names for which the Python core library provides codecs are supported. It aims to be as generic as possible. .. image:: https://repository-images.githubusercontent.com/200259335/d3da9600-dedc-11e9-83e8-081f597505df :width: 500px :alt: CLI Charset Normalizer :align: right It is released under MIT license, see LICENSE for more details. Be aware that no warranty of any kind is provided with this package. Copyright (C) 2023 Ahmed TAHRI <ahmed(dot)tahri(at)cloudnursery.dev> Introduction ============ This library aim to assist you in finding what encoding suit the best to content. It **DOES NOT** try to uncover the originating encoding, in fact this program does not care about it. By originating we means the one that was precisely used to encode a text file. Precisely :: my_byte_str = 'Bonjour, je suis à la recherche d\'une aide sur les étoiles'.encode('cp1252') We **ARE NOT** looking for cp1252 **BUT FOR** ``Bonjour, je suis à la recherche d'une aide sur les étoiles``. Because of this :: my_byte_str.decode('cp1252') == my_byte_str.decode('cp1256') == my_byte_str.decode('cp1258') == my_byte_str.decode('iso8859_14') # Print True ! There is no wrong answer to decode ``my_byte_str`` to get the exact same result. This is where this library differ from others. There's not specific probe per encoding table. Features ======== - Encoding detection on a fp (file pointer), bytes or PathLike. - Transpose any encoded content to Unicode the best we can. - Detect spoken language in text. - Ship with a great CLI. - Also, detect binaries. Start Guide ----------- .. toctree:: :maxdepth: 2 user/support user/getstarted user/advanced_search user/handling_result user/miscellaneous user/cli Community Guide --------------- .. toctree:: :maxdepth: 2 community/speedup community/faq community/why_migrate community/featured Developer Guide --------------- .. toctree:: :maxdepth: 3 api Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search` requirements.txt 0000644 00000000014 15025166611 0010026 0 ustar 00 Sphinx furo conf.py 0000644 00000011760 15025166611 0006053 0 ustar 00 #!/usr/bin/env python3 # -*- coding: utf-8 -*- # # charset-normalizer documentation build configuration file, created by # sphinx-quickstart on Fri Jun 16 04:30:35 2017. # # This file is execfile()d with the current directory set to its # containing dir. # # Note that not all possible configuration values are present in this # autogenerated file. # # All configuration values have a default; values that are commented out # serve to show the default. # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. # # import os import sys import os sys.path.insert(0, os.path.abspath("..")) import charset_normalizer # -- General configuration ------------------------------------------------ # If your documentation needs a minimal Sphinx version, state it here. # # needs_sphinx = '1.0' # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = [ 'sphinx.ext.autodoc', 'sphinx.ext.doctest', 'sphinx.ext.intersphinx', 'sphinx.ext.todo', 'sphinx.ext.coverage', 'sphinx.ext.mathjax', 'sphinx.ext.ifconfig', 'sphinx.ext.viewcode', 'sphinx.ext.githubpages', ] # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] # The suffix(es) of source filenames. # You can specify multiple suffix as a list of string: # # source_suffix = ['.rst', '.md'] # source_suffix = '.rst' source_parsers = {} source_suffix = ['.rst',] # The master toctree document. master_doc = 'index' # General information about the project. project = 'charset_normalizer' copyright = '2023, Ahmed TAHRI' author = 'Ahmed TAHRI' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. version = charset_normalizer.__version__ # The full version, including alpha/beta/rc tags. release = charset_normalizer.__version__ # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. # # This is also used if you do content translation via gettext catalogs. # Usually you set "language" from the command line for these cases. language = "en" # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. # This patterns also effect to html_static_path and html_extra_path exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # If true, `todo` and `todoList` produce output, else they produce nothing. todo_include_todos = False # -- Options for HTML output ---------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. # html_theme = 'furo' html_theme_path = [] # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. # # html_theme_options = {} # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = [] # -- Options for HTMLHelp output ------------------------------------------ # Output file base name for HTML help builder. htmlhelp_basename = 'charset-normalizer-doc' # -- Options for LaTeX output --------------------------------------------- latex_elements = { # The paper size ('letterpaper' or 'a4paper'). # # 'papersize': 'letterpaper', # The font size ('10pt', '11pt' or '12pt'). # # 'pointsize': '10pt', # Additional stuff for the LaTeX preamble. # # 'preamble': '', # Latex figure (float) alignment # # 'figure_align': 'htbp', } # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, # author, documentclass [howto, manual, or own class]). latex_documents = [ (master_doc, 'charset-normalizer.tex', 'Charset Normalizer Documentation', 'Ahmed TAHRI', 'manual'), ] # -- Options for manual page output --------------------------------------- # One entry per manual page. List of tuples # (source start file, name, description, authors, manual section). man_pages = [ (master_doc, 'charset-normalizer', 'Charset Normalizer Documentation', [author], 1) ] # -- Options for Texinfo output ------------------------------------------- texinfo_documents = [ (master_doc, 'Charset Normalizer', 'Charsert Normalizer Documentation', author, 'charset-normalizer', '🔎 Like Chardet. 🚀 Package for encoding & language detection. Charset detection.', 'Miscellaneous'), ] user/advanced_search.rst 0000644 00000005344 15025166611 0011357 0 ustar 00 Advanced Search =============== Charset Normalizer method ``from_bytes``, ``from_fp`` and ``from_path`` provide some optional parameters that can be tweaked. As follow :: from charset_normalizer import from_bytes my_byte_str = 'Bсеки човек има право на образование.'.encode('cp1251') results = from_bytes( my_byte_str, steps=10, # Number of steps/block to extract from my_byte_str chunk_size=512, # Set block size of each extraction threshold=0.2, # Maximum amount of chaos allowed on first pass cp_isolation=None, # Finite list of encoding to use when searching for a match cp_exclusion=None, # Finite list of encoding to avoid when searching for a match preemptive_behaviour=True, # Determine if we should look into my_byte_str (ASCII-Mode) for pre-defined encoding explain=False, # Print on screen what is happening when searching for a match language_threshold=0.1 # Minimum coherence ratio / language ratio match accepted ) Using CharsetMatches ------------------------------ Here, ``results`` is a ``CharsetMatches`` object. It behave like a list but does not implements all related methods. Initially, it is sorted. Calling ``best()`` is sufficient to extract the most probable result. .. autoclass:: charset_normalizer.CharsetMatches :members: List behaviour -------------- Like said earlier, ``CharsetMatches`` object behave like a list. :: # Call len on results also work if not results: print('No match for your sequence') # Iterate over results like a list for match in results: print(match.encoding, 'can decode properly your sequence using', match.alphabets, 'and language', match.language) # Using index to access results if results: print(str(results[0])) Using best() ------------ Like said above, ``CharsetMatches`` object behave like a list and it is sorted by default after getting results from ``from_bytes``, ``from_fp`` or ``from_path``. Using ``best()`` return the most probable result, the first entry of the list. Eg. idx 0. It return a ``CharsetMatch`` object as return value or None if there is not results inside it. :: result = results.best() Calling first() --------------- The very same thing than calling the method ``best()``. Class aliases ------------- ``CharsetMatches`` is also known as ``CharsetDetector``, ``CharsetDoctor`` and ``CharsetNormalizerMatches``. It is useful if you prefer short class name. Verbose output -------------- You may want to understand why a specific encoding was not picked by charset_normalizer. All you have to do is passing ``explain`` to True when using methods ``from_bytes``, ``from_fp`` or ``from_path``. user/handling_result.rst 0000644 00000001123 15025166611 0011436 0 ustar 00 ================ Handling Result ================ When initiating search upon a buffer, bytes or file you can assign the return value and fully exploit it. :: my_byte_str = 'Bсеки човек има право на образование.'.encode('cp1251') # Assign return value so we can fully exploit result result = from_bytes( my_byte_str ).best() print(result.encoding) # cp1251 Using CharsetMatch ---------------------------- Here, ``result`` is a ``CharsetMatch`` object or ``None``. .. autoclass:: charset_normalizer.CharsetMatch :members: user/getstarted.rst 0000644 00000003321 15025166611 0010424 0 ustar 00 Installation ============ This installs a package that can be used from Python (``import charset_normalizer``). To install for all users on the system, administrator rights (root) may be required. Using PIP --------- Charset Normalizer can be installed from pip:: pip install charset-normalizer You may retrieve the latest unicodedata backport as follow:: pip install charset-normalizer[unicode_backport] From git via master ----------------------- You can install from dev-master branch using git:: git clone https://github.com/Ousret/charset_normalizer.git cd charset_normalizer/ python setup.py install Basic Usage =========== The new way ----------- You may want to get right to it. :: from charset_normalizer import from_bytes, from_path # This is going to print out your sequence once properly decoded print( str( from_bytes( my_byte_str ).best() ) ) # You could also want the same from a file print( str( from_path( './data/sample.1.ar.srt' ).best() ) ) Backward compatibility ---------------------- If you were used to python chardet, we are providing the very same ``detect()`` method as chardet. This function is mostly backward-compatible with Chardet. The migration should be painless. :: from charset_normalizer import detect # This will behave exactly the same as python chardet result = detect(my_byte_str) if result['encoding'] is not None: print('got', result['encoding'], 'as detected encoding') You may upgrade your code with ease. CTRL + R ``from chardet import detect`` to ``from charset_normalizer import detect``. user/cli.rst 0000644 00000007162 15025166611 0007034 0 ustar 00 Command Line Interface ====================== charset-normalizer ship with a CLI that should be available as `normalizer`. This is a great tool to fully exploit the detector capabilities without having to write Python code. Possible use cases: #. Quickly discover probable originating charset from a file. #. I want to quickly convert a non Unicode file to Unicode. #. Debug the charset-detector. Down below, we will guide you through some basic examples. Arguments --------- You may simply invoke `normalizer -h` (with the h(elp) flag) to understand the basics. :: usage: normalizer [-h] [-v] [-a] [-n] [-m] [-r] [-f] [-t THRESHOLD] file [file ...] The Real First Universal Charset Detector. Discover originating encoding used on text file. Normalize text to unicode. positional arguments: files File(s) to be analysed optional arguments: -h, --help show this help message and exit -v, --verbose Display complementary information about file if any. Stdout will contain logs about the detection process. -a, --with-alternative Output complementary possibilities if any. Top-level JSON WILL be a list. -n, --normalize Permit to normalize input file. If not set, program does not write anything. -m, --minimal Only output the charset detected to STDOUT. Disabling JSON output. -r, --replace Replace file when trying to normalize it instead of creating a new one. -f, --force Replace file without asking if you are sure, use this flag with caution. -t THRESHOLD, --threshold THRESHOLD Define a custom maximum amount of chaos allowed in decoded content. 0. <= chaos <= 1. --version Show version information and exit. .. code:: bash normalizer ./data/sample.1.fr.srt You may also run the command line interface using: .. code:: bash python -m charset_normalizer ./data/sample.1.fr.srt Main JSON Output ---------------- 🎉 Since version 1.4.0 the CLI produce easily usable stdout result in JSON format. .. code:: json { "path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt", "encoding": "cp1252", "encoding_aliases": [ "1252", "windows_1252" ], "alternative_encodings": [ "cp1254", "cp1256", "cp1258", "iso8859_14", "iso8859_15", "iso8859_16", "iso8859_3", "iso8859_9", "latin_1", "mbcs" ], "language": "French", "alphabets": [ "Basic Latin", "Latin-1 Supplement" ], "has_sig_or_bom": false, "chaos": 0.149, "coherence": 97.152, "unicode_path": null, "is_preferred": true } I recommend the `jq` command line tool to easily parse and exploit specific data from the produced JSON. Multiple File Input ------------------- It is possible to give multiple files to the CLI. It will produce a list instead of an object at the top level. When using the `-m` (minimal output) it will rather print one result (encoding) per line. Unicode Conversion ------------------ If you desire to convert any file to Unicode you will need to append the flag `-n`. It will produce another file, it won't replace it by default. The newly created file path will be declared in `unicode_path` (JSON output). user/support.rst 0000644 00000015617 15025166611 0010005 0 ustar 00 ================= Support ================= **If you are running:** - Python >=2.7,<3.5: Unsupported - Python 3.5: charset-normalizer < 2.1 - Python 3.6: charset-normalizer < 3.1 Upgrade your Python interpreter as soon as possible. ------------------- Supported Encodings ------------------- Here are a list of supported encoding and supported language with latest update. Also this list may change depending of your python version. Charset Normalizer is able to detect any of those encoding. This list is NOT static and depends heavily on what your current cPython version is shipped with. See https://docs.python.org/3/library/codecs.html#standard-encodings =============== =============================================================================================================================== IANA Code Page Aliases =============== =============================================================================================================================== ascii 646, ansi_x3.4_1968, ansi_x3_4_1968, ansi_x3.4_1986, cp367, csascii, ibm367, iso646_us, iso_646.irv_1991, iso_ir_6, us, us_ascii big5 big5_tw, csbig5, x_mac_trad_chinese big5hkscs big5_hkscs, hkscs cp037 037, csibm037, ebcdic_cp_ca, ebcdic_cp_nl, ebcdic_cp_us, ebcdic_cp_wt, ibm037, ibm039 cp1026 1026, csibm1026, ibm1026 cp1125 1125, ibm1125, cp866u, ruscii cp1140 1140, ibm1140 cp1250 1250, windows_1250 cp1251 1251, windows_1251 cp1252 1252, windows_1252 cp1253 1253, windows_1253 cp1254 1254, windows_1254 cp1255 1255, windows_1255 cp1256 1256, windows_1256 cp1257 1257, windows_1257 cp1258 1258, windows_1258 cp273 273, ibm273, csibm273 cp424 424, csibm424, ebcdic_cp_he, ibm424 cp437 437, cspc8codepage437, ibm437 cp500 500, csibm500, ebcdic_cp_be, ebcdic_cp_ch, ibm500 cp775 775, cspc775baltic, ibm775 cp850 850, cspc850multilingual, ibm850 cp852 852, cspcp852, ibm852 cp855 855, csibm855, ibm855 cp857 857, csibm857, ibm857 cp858 858, csibm858, ibm858 cp860 860, csibm860, ibm860 cp861 861, cp_is, csibm861, ibm861 cp862 862, cspc862latinhebrew, ibm862 cp863 863, csibm863, ibm863 cp864 864, csibm864, ibm864 cp865 865, csibm865, ibm865 cp866 866, csibm866, ibm866 cp869 869, cp_gr, csibm869, ibm869 cp932 932, ms932, mskanji, ms_kanji cp949 949, ms949, uhc cp950 950, ms950 euc_jis_2004 jisx0213, eucjis2004, euc_jis2004 euc_jisx0213 eucjisx0213 euc_jp eucjp, ujis, u_jis euc_kr euckr, korean, ksc5601, ks_c_5601, ks_c_5601_1987, ksx1001, ks_x_1001, x_mac_korean gb18030 gb18030_2000 gb2312 chinese, csiso58gb231280, euc_cn, euccn, eucgb2312_cn, gb2312_1980, gb2312_80, iso_ir_58, x_mac_simp_chinese gbk 936, cp936, ms936 hp_roman8 roman8, r8, csHPRoman8 hz hzgb, hz_gb, hz_gb_2312 iso2022_jp csiso2022jp, iso2022jp, iso_2022_jp iso2022_jp_1 iso2022jp_1, iso_2022_jp_1 iso2022_jp_2 iso2022jp_2, iso_2022_jp_2 iso2022_jp_3 iso2022jp_3, iso_2022_jp_3 iso2022_jp_ext iso2022jp_ext, iso_2022_jp_ext iso2022_kr csiso2022kr, iso2022kr, iso_2022_kr iso8859_10 csisolatin6, iso_8859_10, iso_8859_10_1992, iso_ir_157, l6, latin6 iso8859_11 thai, iso_8859_11, iso_8859_11_2001 iso8859_13 iso_8859_13, l7, latin7 iso8859_14 iso_8859_14, iso_8859_14_1998, iso_celtic, iso_ir_199, l8, latin8 iso8859_15 iso_8859_15, l9, latin9 iso8859_16 iso_8859_16, iso_8859_16_2001, iso_ir_226, l10, latin10 iso8859_2 csisolatin2, iso_8859_2, iso_8859_2_1987, iso_ir_101, l2, latin2 iso8859_3 csisolatin3, iso_8859_3, iso_8859_3_1988, iso_ir_109, l3, latin3 iso8859_4 csisolatin4, iso_8859_4, iso_8859_4_1988, iso_ir_110, l4, latin4 iso8859_5 csisolatincyrillic, cyrillic, iso_8859_5, iso_8859_5_1988, iso_ir_144 iso8859_6 arabic, asmo_708, csisolatinarabic, ecma_114, iso_8859_6, iso_8859_6_1987, iso_ir_127 iso8859_7 csisolatingreek, ecma_118, elot_928, greek, greek8, iso_8859_7, iso_8859_7_1987, iso_ir_126 iso8859_8 csisolatinhebrew, hebrew, iso_8859_8, iso_8859_8_1988, iso_ir_138 iso8859_9 csisolatin5, iso_8859_9, iso_8859_9_1989, iso_ir_148, l5, latin5 iso2022_jp_2004 iso_2022_jp_2004, iso2022jp_2004 johab cp1361, ms1361 koi8_r cskoi8r kz1048 kz_1048, rk1048, strk1048_2002 latin_1 8859, cp819, csisolatin1, ibm819, iso8859, iso8859_1, iso_8859_1, iso_8859_1_1987, iso_ir_100, l1, latin, latin1 mac_cyrillic maccyrillic mac_greek macgreek mac_iceland maciceland mac_latin2 maccentraleurope, maclatin2 mac_roman macintosh, macroman mac_turkish macturkish ptcp154 csptcp154, pt154, cp154, cyrillic_asian shift_jis csshiftjis, shiftjis, sjis, s_jis, x_mac_japanese shift_jis_2004 shiftjis2004, sjis_2004, s_jis_2004 shift_jisx0213 shiftjisx0213, sjisx0213, s_jisx0213 tis_620 tis620, tis_620_0, tis_620_2529_0, tis_620_2529_1, iso_ir_166 utf_16 u16, utf16 utf_16_be unicodebigunmarked, utf_16be utf_16_le unicodelittleunmarked, utf_16le utf_32 u32, utf32 utf_32_be utf_32be utf_32_le utf_32le utf_8 u8, utf, utf8, utf8_ucs2, utf8_ucs4 (+utf_8_sig) utf_7* u7, unicode-1-1-utf-7 cp720 N.A. cp737 N.A. cp856 N.A. cp874 N.A. cp875 N.A. cp1006 N.A. koi8_r N.A. koi8_t N.A. koi8_u N.A. =============== =============================================================================================================================== *: Only if a SIG/mark is found. ------------------- Supported Languages ------------------- Those language can be detected inside your content. All of these are specified in ./charset_normalizer/assets/__init__.py . | English, | German, | French, | Dutch, | Italian, | Polish, | Spanish, | Russian, | Japanese, | Portuguese, | Swedish, | Chinese, | Ukrainian, | Norwegian, | Finnish, | Vietnamese, | Czech, | Hungarian, | Korean, | Indonesian, | Turkish, | Romanian, | Farsi, | Arabic, | Danish, | Serbian, | Lithuanian, | Slovene, | Slovak, | Malay, | Hebrew, | Bulgarian, | Croatian, | Hindi, | Estonian, | Thai, | Greek, | Tamil. ---------------------------- Incomplete Sequence / Stream ---------------------------- It is not (yet) officially supported. If you feed an incomplete byte sequence (eg. truncated multi-byte sequence) the detector will most likely fail to return a proper result. If you are purposely feeding part of your payload for performance concerns, you may stop doing it as this package is fairly optimized. We are working on a dedicated way to handle streams. user/miscellaneous.rst 0000644 00000003753 15025166611 0011132 0 ustar 00 ============== Miscellaneous ============== Convert to str -------------- Any ``CharsetMatch`` object can be transformed to exploitable ``str`` variable. :: my_byte_str = 'Bсеки човек има право на образование.'.encode('cp1251') # Assign return value so we can fully exploit result result = from_bytes( my_byte_str ).best() # This should print 'Bсеки човек има право на образование.' print(str(result)) Logging ------- Prior to the version 2.0.11 you may encounter some unexpected logs in your streams. Something along the line of: :: ... | WARNING | override steps (5) and chunk_size (512) as content does not fit (465 byte(s) given) parameters. ... | INFO | ascii passed initial chaos probing. Mean measured chaos is 0.000000 % ... | INFO | ascii should target any language(s) of ['Latin Based'] It is most likely because you altered the root getLogger instance. The package has its own logic behind logging and why it is useful. See https://docs.python.org/3/howto/logging.html to learn the basics. If you are looking to silence and/or reduce drastically the amount of logs, please upgrade to the latest version available for `charset-normalizer` using your package manager or by `pip install charset-normalizer -U`. The latest version will no longer produce any entry greater than `DEBUG`. On `DEBUG` only one entry will be observed and that is about the detection result. Then regarding the others log entries, they will be pushed as `Level 5`. Commonly known as TRACE level, but we do not register it globally. Detect binaries --------------- This package offers a neat way to detect files that can be considered as 'binaries' meaning that it is not likely to be a text-file. :: from charset_normalizer import is_binary # It can receive both a path or bytes or even a file pointer. result = is_binary("./my-file.ext") # This should print 'True' or 'False' print(result) community/faq.rst 0000644 00000007135 15025166611 0010102 0 ustar 00 Frequently asked questions =========================== Is UTF-8 everywhere already? ---------------------------- Not really, that is a dangerous assumption. Looking at https://w3techs.com/technologies/overview/character_encoding may seem like encoding detection is a thing of the past but not really. Solo based on 33k websites, you will find 3,4k responses without predefined encoding. 1,8k websites were not UTF-8, merely half! This statistic (w3techs) does not offer any ponderation, so one should not read it as "I have a 97 % chance of hitting UTF-8 content on HTML content". (2021 Top 1000 sites from 80 countries in the world according to Data for SEO) https://github.com/potiuk/test-charset-normalizer First of all, neither requests, chardet or charset-normalizer are dedicated to HTML content. The detection concern every text document, like SubRip Subtitle files for instance. And by my own experiences, I never had a single database using full utf-8, many translated subtitles are from another era and never updated. It is so hard to find any stats at all regarding this matter. Users' usages can be very dispersed, so making assumptions are unwise. The real debate is to state if the detection is an HTTP client matter or not. That is more complicated and not my field. Some individuals keep insisting that the *whole* Internet is UTF-8 ready. Those are absolutely wrong and very Europe and North America-centered, In my humble experience, the countries in the world are very disparate in this evolution. And the Internet is not just about HTML content. Having a thorough analysis of this is very scary. Should I bother using detection? -------------------------------- In the last resort, yes. You should use well-established standards, eg. predefined encoding, at all times. When you are left with no clue, you may use the detector to produce a usable output as fast as possible. Is it backward-compatible with Chardet? --------------------------------------- If you use the legacy `detect` function, Then this change is mostly backward-compatible, exception of a thing: - This new library support way more code pages (x3) than its counterpart Chardet. - Based on the 30-ich charsets that Chardet support, expect roughly 80% BC results We do not guarantee this BC exact percentage through time. May vary but not by much. Isn't it the same as Chardet? ----------------------------- The objective is the same, provide you with the best answer (charset) we can given any sequence of bytes. The method actually differs. We do not "train" anything to build a probe for a specific encoding. In addition to finding any languages (intelligent design) by some rudimentary statistics (character frequency ordering) we built a mess detector to assist the language detection. Any code page supported by your cPython is supported by charset-normalizer! It is that simple, no need to update the library. It is as generic as we could do. I can't build standalone executable ----------------------------------- If you are using ``pyinstaller``, ``py2exe`` or alike, you may be encountering this or close to: ModuleNotFoundError: No module named 'charset_normalizer.md__mypyc' Why? - Your package manager picked up a optimized (for speed purposes) wheel that match your architecture and operating system. - Finally, the module ``charset_normalizer.md__mypyc`` is imported via binaries and can't be seen using your tool. How to remedy? If your bundler program support it, set up a hook that implicitly import the hidden module. Otherwise, follow the guide on how to install the vanilla version of this package. (Section: *Optional speedup extension*) community/why_migrate.rst 0000644 00000002104 15025166611 0011641 0 ustar 00 Why should I migrate to Charset-Normalizer? =========================================== There is so many reason to migrate your current project. Here are some of them: - Remove ANY license ambiguity/restriction for projects bundling Chardet (even indirectly). - X10 faster than Chardet in average and X6 faster in 99% of the cases AND support 3 times more encoding. - Never return a encoding if not suited for the given decoder. Eg. Never get UnicodeDecodeError! - Actively maintained, open to contributors. - Have the backward compatible function ``detect`` that come from Chardet. - Truly detect the language used in the text. - It is, for the first time, really universal! As there is no specific probe per charset. - The package size is X2~X4 lower than Chardet's (5.0)! (Depends on your arch) - Propose much more options/public kwargs to tweak the detection as you sees fit! - Using static typing to ease your development. - Detect Unicode content better than Chardet or cChardet does. And much more..! What are you waiting for? Upgrade now and give us a feedback. (Even if negative) community/speedup.rst 0000644 00000002760 15025166611 0010777 0 ustar 00 Optional speedup extension ========================== Why? ---- charset-normalizer will always remain pure Python, meaning that a environment without any build capabilities will run this program without any additional requirements. Nonetheless, starting from the version 3.0 we introduce and publish some platform specific wheels including a pre-built extension. Most of the time is spent in the module `md.py` so we decided to "compile it" using Mypyc. (1) It does not require to have a separate code base (2) Our project code base is rather simple and lightweight (3) Mypyc is robust enough today (4) Four times faster! How? ---- If your platform and/or architecture is not served by this swift optimization you may compile it easily yourself. Following those instructions (provided you have the necessary toolchain installed): :: export CHARSET_NORMALIZER_USE_MYPYC=1 pip install mypy build wheel pip install charset-normalizer --no-binary :all: How not to? ----------- You may install charset-normalizer without the speedups by directly using the universal wheel (most likely hosted on PyPI or any valid mirror you use) with ``--no-binary``. E.g. when installing ``requests`` and you don't want to use the ``charset-normalizer`` speedups, you can do: :: pip install requests --no-binary charset-normalizer When installing `charset-normalizer` by itself, you can also pass ``:all:`` as the specifier to ``--no-binary``. :: pip install charset-normalizer --no-binary :all: community/featured.rst 0000644 00000004157 15025166611 0011133 0 ustar 00 Featured projects ================= Did you liked how ``charset-normalizer`` perform? and its quality? You may be interested in those other project maintained by the same authors. We aim to serve the opensource community the best and as inclusively as we can, no matter your level or opinions. Niquests -------- Started as a simple though.. IE 11 has built-in HTTP/2 support while Requests 2.32 does not! Most of our programs that interact with HTTP server are built with ``requests`` and we aren't likely to switch without a substantial effort. We just might die at any moment, no notice whatsoever, knowingly that as a Python developer, we never interacted with a HTTP/2 over TCP or HTTP/3 over QUIC capable server in 2023... .. image:: https://dabuttonfactory.com/button.png?t=Get+Niquests+Now&f=Ubuntu-Bold&ts=26&tc=fff&hp=45&vp=20&c=11&bgt=unicolored&bgc=15d798&be=1 :target: https://github.com/jawah/niquests It is a fork of ``requests`` and no breaking changes are to be expected. We made sure that your migration is effortless and safe. httpie-next ----------- Easy solution are cool, let us introduce you to HTTPie but with built-in support for HTTP/2 and HTTP/3. It is made available as a plugin, no effort required beside installing the said plugin. Enjoy HTTPie refreshed! .. image:: https://dabuttonfactory.com/button.png?t=Get+HTTPie-Next+Now&f=Ubuntu-Bold&ts=26&tc=fff&hp=45&vp=20&c=11&bgt=unicolored&bgc=15d798&be=1 :target: https://github.com/Ousret/httpie-next Wassima ------- Did you ever wonder how would it feel like to leave the headache with root CAs (certificate authority)? Well, you may, starting today, use your operating system trusted root CAs to verify peer certificates with the at most comfort. It is enabled by default in Niquests, but you can use that awesome feature by itself. .. image:: https://dabuttonfactory.com/button.png?t=OS+root+CAs+for+Python&f=Ubuntu-Bold&ts=26&tc=fff&hp=45&vp=20&c=11&bgt=unicolored&bgc=15d798&be=1 :target: https://github.com/jawah/wassima The solution is universal and served for (almost) every possible case. You may remove the certifi package, let it rest in peace. Makefile 0000644 00000001152 15025166611 0006206 0 ustar 00 # Minimal makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD = python -msphinx SPHINXPROJ = Charset Normalizer SOURCEDIR = . BUILDDIR = _build # Put it first so that "make" without argument is like "make help". help: @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) .PHONY: help Makefile # Catch-all target: route all unknown targets to Sphinx using the new # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). %: Makefile @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) make.bat 0000644 00000001421 15025166611 0006152 0 ustar 00 @ECHO OFF pushd %~dp0 REM Command file for Sphinx documentation if "%SPHINXBUILD%" == "" ( set SPHINXBUILD=sphinx-build ) set SOURCEDIR=. set BUILDDIR=build set SPHINXPROJ=charset_normalizer if "%1" == "" goto help %SPHINXBUILD% >NUL 2>NUL if errorlevel 9009 ( echo. echo.The 'sphinx-build' command was not found. Make sure you have Sphinx echo.installed, then set the SPHINXBUILD environment variable to point echo.to the full path of the 'sphinx-build' executable. Alternatively you echo.may add the Sphinx directory to PATH. echo. echo.If you don't have Sphinx installed, grab it from echo.http://sphinx-doc.org/ exit /b 1 ) %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% goto end :help %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% :end popd
| ver. 1.4 |
Github
|
.
| PHP 8.2.28 | Generation time: 0.02 |
proxy
|
phpinfo
|
Settings