This document is a tutorial on mastering Python 3 I/O presented by David Beazley at PyCon 2010. It discusses how Python 3 reimplemented the entire I/O stack, introducing new programming idioms. The tutorial will provide a top-to-bottom tour of the Python 3 I/O system, including text handling, binary data, system interfaces, the new I/O stack, standard library issues, and memory views. Prior experience with Python 3 is not required, but familiarity with Python 2 I/O is assumed.
The document discusses a Python tutorial presentation on systems programming. It describes building programs to analyze Firefox browser cache files, including a findcache.py program that recursively searches a directory for Firefox cache folders. The goal is to demonstrate Python for practical system tasks like file parsing and processing. Disclaimers note the code only uses standard Python and is intended for educational purposes.
The document discusses the Global Interpreter Lock (GIL) in Python and how it can limit thread performance. An experiment shows that dividing a CPU-bound task across two threads results in slower performance than running it sequentially. The talk will examine threads and the GIL in depth, including how thread switching works and what can go wrong, in an effort to explain this mystery. Visualizations of GIL logging data reveal interesting thread scheduling behaviors on single and multicore systems.
The document is a presentation on Python given at the USENIX LISA conference in 2007. It introduces Python by discussing its history, influences, uses, and focus on systems programming. It provides an overview of getting started with Python, including where to get it, running Python in interactive mode or from files, and creating Python programs. It also includes a sample mortgage calculation program written in Python as an example.
This document provides an overview of a tutorial on mastering Python 3 I/O. The tutorial will cover the reimplemented I/O system in Python 3, including text handling, formatting, binary data handling, the new I/O stack, system interfaces, and library design issues. It assumes some familiarity with how I/O works in Python 2 and will take a detailed tour of the changes in Python 3.
The document describes using generators to process large log files. A non-generator solution would open the log file and iterate through it line-by-line, splitting each line to extract the byte value and add it to a running total. However, this requires keeping the entire file contents in memory. A generator-based solution yields lines from the file one at a time to avoid loading the entire file into memory at once.
Why Extension Programmers Should Stop Worrying About Parsing and Start Thinki...David Beazley (Dabeaz LLC)
The document discusses the challenges with automatically generating extension wrappers from header file parsing alone. While parsing provides a convenient interface, extension building is fundamentally about linking two type systems together in a meaningful way. The document argues that parsing focuses on the wrong problem, as header files do not contain enough semantic information to fully understand C++ types, templates, and relationships. The document provides examples of cases that would break a parser-only approach.
This document provides an introduction to concurrency in Python using threads. It discusses how threads allow programs to perform multiple tasks simultaneously by sharing system resources like memory. The document covers basic threading concepts like creating and launching threads, as well as challenges like accessing shared data between threads, which can be non-deterministic due to thread scheduling. It aims to provide an overview of concurrency support in the Python standard library beyond just the user manual.
The document discusses a Python tutorial presentation on systems programming. It describes building programs to analyze Firefox browser cache files, including a findcache.py program that recursively searches a directory for Firefox cache folders. The goal is to demonstrate Python for practical system tasks like file parsing and processing. Disclaimers note the code only uses standard Python and is intended for educational purposes.
The document discusses the Global Interpreter Lock (GIL) in Python and how it can limit thread performance. An experiment shows that dividing a CPU-bound task across two threads results in slower performance than running it sequentially. The talk will examine threads and the GIL in depth, including how thread switching works and what can go wrong, in an effort to explain this mystery. Visualizations of GIL logging data reveal interesting thread scheduling behaviors on single and multicore systems.
The document is a presentation on Python given at the USENIX LISA conference in 2007. It introduces Python by discussing its history, influences, uses, and focus on systems programming. It provides an overview of getting started with Python, including where to get it, running Python in interactive mode or from files, and creating Python programs. It also includes a sample mortgage calculation program written in Python as an example.
This document provides an overview of a tutorial on mastering Python 3 I/O. The tutorial will cover the reimplemented I/O system in Python 3, including text handling, formatting, binary data handling, the new I/O stack, system interfaces, and library design issues. It assumes some familiarity with how I/O works in Python 2 and will take a detailed tour of the changes in Python 3.
The document describes using generators to process large log files. A non-generator solution would open the log file and iterate through it line-by-line, splitting each line to extract the byte value and add it to a running total. However, this requires keeping the entire file contents in memory. A generator-based solution yields lines from the file one at a time to avoid loading the entire file into memory at once.
Why Extension Programmers Should Stop Worrying About Parsing and Start Thinki...David Beazley (Dabeaz LLC)
The document discusses the challenges with automatically generating extension wrappers from header file parsing alone. While parsing provides a convenient interface, extension building is fundamentally about linking two type systems together in a meaningful way. The document argues that parsing focuses on the wrong problem, as header files do not contain enough semantic information to fully understand C++ types, templates, and relationships. The document provides examples of cases that would break a parser-only approach.
This document provides an introduction to concurrency in Python using threads. It discusses how threads allow programs to perform multiple tasks simultaneously by sharing system resources like memory. The document covers basic threading concepts like creating and launching threads, as well as challenges like accessing shared data between threads, which can be non-deterministic due to thread scheduling. It aims to provide an overview of concurrency support in the Python standard library beyond just the user manual.
The document discusses Python generator functions and expressions, providing examples of how generator functions can be used to iteratively yield values like in a countdown, and how generator expressions are similar to list comprehensions but produce values iteratively instead of building a list. The document also discusses using generators to process data files like summing the bytes transferred from entries in an Apache web server log.
The document describes a tutorial on generator tricks for systems programmers given by David Beazley. It introduces generators and how they can be used to process large data files like web server logs more efficiently by generating values one at a time instead of building large lists. Specifically, it presents a problem of summing the last column of a web log to find the total bytes transferred, and suggests using a generator to process the log line by line due to potentially large file sizes.
The document discusses the SWIG tool, which allows C/C++ code to be integrated with scripting languages like Perl, Python, and Tcl. SWIG works by taking C/C++ header files, wrapping the code in a wrapper module, and compiling it into an extension that can then be used from the scripting language. While SWIG handles many common C/C++ features automatically, interface building can sometimes be challenging due to issues like pointer ambiguity, preprocessor macros, or advanced C++ features that SWIG may not fully support.
This document discusses a presentation on advanced uses of generators in Python. It covers context managers, which allow entry and exit actions for code blocks using the 'with' statement. Generators can be used to implement context managers via the yield keyword and decorator contextmanager. This transforms generator functions into objects that support the required __enter__ and __exit__ methods to monitor code block execution. The presentation aims to expand understanding of generators beyond iteration by exploring this technique for simplifying resource management tasks like file handling through context managers.
This document discusses SWIG (Simplified Wrapper and Interface Generator), which is a tool that takes C/C++ declarations as input and generates bindings to other languages like Python, Tcl, Perl, and Guile. SWIG allows functions, variables, constants, and C++ classes to be accessed from these scripting languages. It handles data type conversions and run-time type checking. The document provides examples of using SWIG to expose a simple C function and C++ class to Python.
SWIG is a tool that can compile C/C++ code into bindings that allow the code to be accessed from scripting languages like Tcl, Python, Perl and others. It supports most C/C++ data types, simple C++ classes, runtime type checking, multiple files and modules, and automatically generates documentation. SWIG allows for rapid prototyping by wrapping large amounts of code with little development time needed. It also enables building modular and reusable applications by creating language-independent components.
The OSI Superboard II was the computer on which I first learned to program back in 1979. Python is why programming remains fun today. In this tale of old meets new, I describe how I have used Python 3 to create a cloud computing service for my still-working Superboard--a problem complicated by it only having 8Kb of RAM and 300-baud cassette tape audio ports for I/O.
The Common Debian Build System (CDBS) provides makefile fragments to simplify Debian package building. It includes classes that handle common build systems like autotools and distutils. CDBS aims to standardize packaging and automatically implement policy changes. While it helps most packages, it may be unnecessary for simple packages or those with unusual build systems. CDBS maintenance involves addressing bugs, feature requests, outdated documentation, and ensuring compatibility with Debian subpolicies.
The PyConTW (http://tw.pycon.org) organizer wishes to improve the quality and quantity of the programming cummunities in Taiwan. Though Python is their core tool and methodology, they know it's worth to learn and communicate with wide-ranging communities. Understanding cultures and ecosystem of a language takes me about three to six months. This six-hour course wraps up what I - an experienced Java developer - have learned from Python ecosystem and the agenda of the past PyConTW.
你可以在以下鏈結找到中文內容:
http://www.codedata.com.tw/python/python-tutorial-the-1st-class-1-preface
KOSS Lab. Conference 2016에서 튜토리얼 섹션으로 진행하였습니다.
link: https://kosscon.kr/program/tutorial#11
제로부터 시작하는 텐서플로우
Zero Staring Life in Artificial Intelligence
텐서플로우는 구글에서 공개한 오픈 소스 프로젝트입니다.
기계학습에 관심이 있지만 IT를 잘 모르시는 분들을 위해서
가상환경부터 시작해서 텐서플로우를 설치하고,
예제코드를 따라하면서,
기계학습을 공부하는 사람들에게 마중물(!) 역활을 작게나마 도움이 되었으면 좋겠습니다.
This document describes a karaoke-style read-aloud system that uses speech alignment and text-to-speech technology. It involves using a text-to-speech API to generate an audio file from text, then aligning the audio with the text using hidden Markov model tools (HTK) to create a timed text file. This allows highlighting text as it is read like a karaoke system and has applications for language learning by allowing shadowing of speech. The process involves text preprocessing, audio generation and processing, phonetic transcription, forced alignment with HTK, and output of a timed text file.
This document provides an overview of using Python for bioinformatics. It discusses why Python is useful for bioinformatics due to its built-in libraries and wide scientific use. It also covers Python basics like strings, regular expressions, control structures, lists, dictionaries, reading/writing files, and using GitHub for code sharing. Examples are given for many of these topics. Finally, it poses questions about analyzing sequence data and a protein database using Python.
HOW 2019: A complete reproducible ROOT environment in under 5 minutesHenry Schreiner
The document discusses setting up a ROOT environment using Conda in under 5 minutes. It describes downloading and installing Miniconda and then using Conda commands to create a new environment and install ROOT and its dependencies from the conda-forge channel. The ROOT package provides full ROOT functionality, including compilation and graphics, and supports Linux, macOS, and multiple Python versions.
Sally Kleinfeldt and Aaron VanDerlip describe ore.bigfile, a minimalist solution to the problem of uploading, downloading, and versioning very large files in Plone.
The document introduces SD, a peer-to-peer bug tracking tool developed by Best Practical to allow tracking bugs offline and syncing work across devices. SD uses a decentralized model where each installation can pull changes from any other replica. It supports syncing with other bug trackers like RT, Trac and Google Code. The author argues that cloud services make users dependent while SD empowers fully offline and distributed work by syncing like users naturally share files.
Beating the (sh** out of the) GIL - Multithreading vs. MultiprocessingGuy K. Kloss
Talk given at the June 2008 meeting of the New Zealand Python User Group in Auckland.
Outline: An overview to approaches for parallel/concurrent programming in Python.
Code demonstrated in the presentation can be found here:
http://www.kloss-familie.de/moin/TalksPresentations
FILEgrain: Transport-Agnostic, Fine-Grained Content-Addressable Container Ima...Akihiro Suda
My talk at Open Source Summit North America (Los Angeles - September 11, 2017): http://sched.co/BDpM
---
The current Docker/OCI image format uses TAR archives, which are created for each of Dockerfile `RUN` changesets, for representing rootfs layers.
One of the problems with this format is that a container cannot be started until all the TAR archives are downloaded.
Also, the format has limitations in concurrency of downloading, and granularity of file deduplication among different versions of images.
FILEgrain solves these problems by using content-addressable store in the granularity of files, rather than of TAR archives, in the transport-agnostic way.
Since the files can be lazily downloaded, a container can be started without downloading whole the image.
The experimental result with 633MB of Java image shows that downloading 4MB of files is enough for running sh, 87MB for JRE, and 136MB for JDK.
Further information are available at https://github.com/AkihiroSuda/filegrain .
Sphinx autodoc - automated api documentation - PyCon.KR 2015Takayuki Shimizukawa
Using the automated documentation feature of Sphinx, you can make with ease the extensive documentation of Python program.
You just write python function documents (docstrings), Sphinx organizes them into the document, can be converted to a variety of formats.
In this session, I'll explain a documentation procedure that uses with sphinx autodoc and autosummary extensions.
A talk given at the August 2010 meeting of the Linux Users of Victoria. About using their mailing list of some 20,000 messages (since the start of 2007) with over 2 million words, as a demonstration of using a web corpus in NLTK (Natural Language Tool Kit), the Python library.
This document discusses using Debian and Taverna to facilitate community-driven computational biology. Debian can provide thousands of bioinformatics packages and tools as well as a supportive community. Taverna allows connecting local tools, web services, and cloud instances into workflows. Together Debian and Taverna can help share bioinformatics recipes, integrate public data, and foster collaboration on cloud images and software packages.
Python is a high-level, general-purpose programming language that emphasizes code readability. It supports multiple programming paradigms including object-oriented, imperative and functional programming. Python code is typically shorter than equivalent Java or C++ code, due to features like automatic memory management and its use of indentation for code blocks rather than curly braces. While Python code runs slower than compiled languages, its rapid development cycle makes it productive for prototyping and small-to-medium sized projects.
The document discusses Python generator functions and expressions, providing examples of how generator functions can be used to iteratively yield values like in a countdown, and how generator expressions are similar to list comprehensions but produce values iteratively instead of building a list. The document also discusses using generators to process data files like summing the bytes transferred from entries in an Apache web server log.
The document describes a tutorial on generator tricks for systems programmers given by David Beazley. It introduces generators and how they can be used to process large data files like web server logs more efficiently by generating values one at a time instead of building large lists. Specifically, it presents a problem of summing the last column of a web log to find the total bytes transferred, and suggests using a generator to process the log line by line due to potentially large file sizes.
The document discusses the SWIG tool, which allows C/C++ code to be integrated with scripting languages like Perl, Python, and Tcl. SWIG works by taking C/C++ header files, wrapping the code in a wrapper module, and compiling it into an extension that can then be used from the scripting language. While SWIG handles many common C/C++ features automatically, interface building can sometimes be challenging due to issues like pointer ambiguity, preprocessor macros, or advanced C++ features that SWIG may not fully support.
This document discusses a presentation on advanced uses of generators in Python. It covers context managers, which allow entry and exit actions for code blocks using the 'with' statement. Generators can be used to implement context managers via the yield keyword and decorator contextmanager. This transforms generator functions into objects that support the required __enter__ and __exit__ methods to monitor code block execution. The presentation aims to expand understanding of generators beyond iteration by exploring this technique for simplifying resource management tasks like file handling through context managers.
This document discusses SWIG (Simplified Wrapper and Interface Generator), which is a tool that takes C/C++ declarations as input and generates bindings to other languages like Python, Tcl, Perl, and Guile. SWIG allows functions, variables, constants, and C++ classes to be accessed from these scripting languages. It handles data type conversions and run-time type checking. The document provides examples of using SWIG to expose a simple C function and C++ class to Python.
SWIG is a tool that can compile C/C++ code into bindings that allow the code to be accessed from scripting languages like Tcl, Python, Perl and others. It supports most C/C++ data types, simple C++ classes, runtime type checking, multiple files and modules, and automatically generates documentation. SWIG allows for rapid prototyping by wrapping large amounts of code with little development time needed. It also enables building modular and reusable applications by creating language-independent components.
The OSI Superboard II was the computer on which I first learned to program back in 1979. Python is why programming remains fun today. In this tale of old meets new, I describe how I have used Python 3 to create a cloud computing service for my still-working Superboard--a problem complicated by it only having 8Kb of RAM and 300-baud cassette tape audio ports for I/O.
The Common Debian Build System (CDBS) provides makefile fragments to simplify Debian package building. It includes classes that handle common build systems like autotools and distutils. CDBS aims to standardize packaging and automatically implement policy changes. While it helps most packages, it may be unnecessary for simple packages or those with unusual build systems. CDBS maintenance involves addressing bugs, feature requests, outdated documentation, and ensuring compatibility with Debian subpolicies.
The PyConTW (http://tw.pycon.org) organizer wishes to improve the quality and quantity of the programming cummunities in Taiwan. Though Python is their core tool and methodology, they know it's worth to learn and communicate with wide-ranging communities. Understanding cultures and ecosystem of a language takes me about three to six months. This six-hour course wraps up what I - an experienced Java developer - have learned from Python ecosystem and the agenda of the past PyConTW.
你可以在以下鏈結找到中文內容:
http://www.codedata.com.tw/python/python-tutorial-the-1st-class-1-preface
KOSS Lab. Conference 2016에서 튜토리얼 섹션으로 진행하였습니다.
link: https://kosscon.kr/program/tutorial#11
제로부터 시작하는 텐서플로우
Zero Staring Life in Artificial Intelligence
텐서플로우는 구글에서 공개한 오픈 소스 프로젝트입니다.
기계학습에 관심이 있지만 IT를 잘 모르시는 분들을 위해서
가상환경부터 시작해서 텐서플로우를 설치하고,
예제코드를 따라하면서,
기계학습을 공부하는 사람들에게 마중물(!) 역활을 작게나마 도움이 되었으면 좋겠습니다.
This document describes a karaoke-style read-aloud system that uses speech alignment and text-to-speech technology. It involves using a text-to-speech API to generate an audio file from text, then aligning the audio with the text using hidden Markov model tools (HTK) to create a timed text file. This allows highlighting text as it is read like a karaoke system and has applications for language learning by allowing shadowing of speech. The process involves text preprocessing, audio generation and processing, phonetic transcription, forced alignment with HTK, and output of a timed text file.
This document provides an overview of using Python for bioinformatics. It discusses why Python is useful for bioinformatics due to its built-in libraries and wide scientific use. It also covers Python basics like strings, regular expressions, control structures, lists, dictionaries, reading/writing files, and using GitHub for code sharing. Examples are given for many of these topics. Finally, it poses questions about analyzing sequence data and a protein database using Python.
HOW 2019: A complete reproducible ROOT environment in under 5 minutesHenry Schreiner
The document discusses setting up a ROOT environment using Conda in under 5 minutes. It describes downloading and installing Miniconda and then using Conda commands to create a new environment and install ROOT and its dependencies from the conda-forge channel. The ROOT package provides full ROOT functionality, including compilation and graphics, and supports Linux, macOS, and multiple Python versions.
Sally Kleinfeldt and Aaron VanDerlip describe ore.bigfile, a minimalist solution to the problem of uploading, downloading, and versioning very large files in Plone.
The document introduces SD, a peer-to-peer bug tracking tool developed by Best Practical to allow tracking bugs offline and syncing work across devices. SD uses a decentralized model where each installation can pull changes from any other replica. It supports syncing with other bug trackers like RT, Trac and Google Code. The author argues that cloud services make users dependent while SD empowers fully offline and distributed work by syncing like users naturally share files.
Beating the (sh** out of the) GIL - Multithreading vs. MultiprocessingGuy K. Kloss
Talk given at the June 2008 meeting of the New Zealand Python User Group in Auckland.
Outline: An overview to approaches for parallel/concurrent programming in Python.
Code demonstrated in the presentation can be found here:
http://www.kloss-familie.de/moin/TalksPresentations
FILEgrain: Transport-Agnostic, Fine-Grained Content-Addressable Container Ima...Akihiro Suda
My talk at Open Source Summit North America (Los Angeles - September 11, 2017): http://sched.co/BDpM
---
The current Docker/OCI image format uses TAR archives, which are created for each of Dockerfile `RUN` changesets, for representing rootfs layers.
One of the problems with this format is that a container cannot be started until all the TAR archives are downloaded.
Also, the format has limitations in concurrency of downloading, and granularity of file deduplication among different versions of images.
FILEgrain solves these problems by using content-addressable store in the granularity of files, rather than of TAR archives, in the transport-agnostic way.
Since the files can be lazily downloaded, a container can be started without downloading whole the image.
The experimental result with 633MB of Java image shows that downloading 4MB of files is enough for running sh, 87MB for JRE, and 136MB for JDK.
Further information are available at https://github.com/AkihiroSuda/filegrain .
Sphinx autodoc - automated api documentation - PyCon.KR 2015Takayuki Shimizukawa
Using the automated documentation feature of Sphinx, you can make with ease the extensive documentation of Python program.
You just write python function documents (docstrings), Sphinx organizes them into the document, can be converted to a variety of formats.
In this session, I'll explain a documentation procedure that uses with sphinx autodoc and autosummary extensions.
A talk given at the August 2010 meeting of the Linux Users of Victoria. About using their mailing list of some 20,000 messages (since the start of 2007) with over 2 million words, as a demonstration of using a web corpus in NLTK (Natural Language Tool Kit), the Python library.
This document discusses using Debian and Taverna to facilitate community-driven computational biology. Debian can provide thousands of bioinformatics packages and tools as well as a supportive community. Taverna allows connecting local tools, web services, and cloud instances into workflows. Together Debian and Taverna can help share bioinformatics recipes, integrate public data, and foster collaboration on cloud images and software packages.
Python is a high-level, general-purpose programming language that emphasizes code readability. It supports multiple programming paradigms including object-oriented, imperative and functional programming. Python code is typically shorter than equivalent Java or C++ code, due to features like automatic memory management and its use of indentation for code blocks rather than curly braces. While Python code runs slower than compiled languages, its rapid development cycle makes it productive for prototyping and small-to-medium sized projects.
The document discusses techniques, challenges, and best practices for handling input/output (I/O) operations in Java. It covers the different types of I/O, how Java supports I/O through streams and readers/writers, issues with streams, alternatives like NIO that support non-blocking I/O using buffers and channels, and "Hiranya's Laws" with guidelines for proper I/O handling.
This document provides an introduction to the Python programming language. It discusses why Python is used, what Python can be used for, its technical strengths, and its few downsides. It also provides instructions on installing Python and running a simple "Hello World" program. The key points are that Python is readable, maintainable, and has a small code size; it can be used for systems programming, GUIs, scripting, databases, and more; and its main downside is potential slower execution speed compared to compiled languages like C and C++.
Building cloud-enabled genomics workflows with Luigi and DockerJacob Feala
Talk given at Bio-IT 2016, Cloud Computing track
Abstract:
As bioinformatics scientists, we tend to write custom tools for managing our workflows, even when viable, open-source alternatives are available from the tech community. Our field has, however, begun to adopt Docker containers to stabilize compute environments. In this talk, I will introduce Luigi, a workflow system built by engineers at Spotify to manage long-running big data processing jobs with complex dependencies. Focusing on a case study of next generation sequencing analysis in cancer genomics research, I will show how Luigi can connect simple, containerized applications into complex bioinformatics pipelines that can be easily integrated with compute, storage, and data warehousing on the cloud.
BP107: Ten Lines Or Less: Interesting Things You Can Do In Java With Minimal ...panagenda
Don’t be afraid of Java! Many IBM Notes/Domino developers, both new and seasoned, have an irrational fear of learning and using Java because it seems overwhelming. Julian and Kathy will help you over this stumbling block with several short, understandable, and useful examples of Java that you can learn from. All of the examples will be ten lines of code or less, making them approachable and easy to understand. And we will show you how to integrate the Java code with an XPages application so you can get started right away.
10 Lines or Less; Interesting Things You Can Do In Java With Minimal CodeKathy Brown
This document provides 10 lines of Java code examples for working with files and images in IBM Notes/Domino. It discusses reading and writing files using different methods like FileChannel and BufferedReader. It also demonstrates how to create a thumbnail image from a file attachment and embed it in a rich text field as a MIME object rather than a file attachment. The document emphasizes using try/finally blocks to properly close streams and considers server permissions and memory usage implications.
Version control, like Git, allows developers to track changes made to code and assets over time by saving revisions to a remote server or repository. This allows for easy collaboration, reverting mistakes, taking risks with new features, and preventing work from being lost due to hardware failures. The document recommends using a distributed version control system like Git for game development projects and outlines best practices for setting it up with Unity.
This is a python course for beginners, intended both for frontal class learning as well as self-work.
The Course is designed for 2 days and then another week of HW assignments.
This is the presentation I gave about Python 3.5 to my research group. It was my intention to introduce the Python language to some of the new members who don't know or have little knowledge about the language.
Duke Docker Day 2014: Research Applications with DockerDarin London
This talk was presented at the first annual Duke Docker Day presented by the Duke Office of Information Technology. It describes a reproducible analysis pipeline using Docker images
DEVNET-2003 Coding 203: Python - User Input, File I/O, Logging and REST API C...Cisco DevNet
This hands-on session will build on your Python skills and take you deeper into the language to explore input, output and interacting with a RESTFUL API. Bring your laptop and join in the coding. If you would like to code along during the session, follow the “How To Setup Your Own Computer” section at the top of this learning lab: https://learninglabs.cisco.com/#/labs/coding-102-rest-python/step/1 before you come to the session.
This document contains an interview transcript with Bjarne Stroustrup, the creator of the C++ programming language. In the interview, Stroustrup discusses various aspects of the new C++0x standard, including its improved support for concurrency, threads, avoiding data races, and passing values between threads. He emphasizes C++'s role as a general purpose language suitable for systems programming that combines low-level machine efficiency with flexible abstraction mechanisms.
IS - section 1 - modifiedFinal information system.pptxNaglaaAbdelhady
GitHub is a version control system and cloud-based service that allows developers to store and manage code. Version control systems like Git track changes to code over time and allow for branching and merging. GitHub Desktop and Git Bash are graphical and command-line interfaces that allow interacting with GitHub locally. Files in C++ can be used to persistently store data and streams are used for input/output. The fstream class supports file input/output and ofstream and ifstream are used for output and input respectively.
Python is a popular programming language created by Guido van Rossum in 1991. It is easy to use, powerful, and versatile, making it suitable for beginners and experts alike. Python code can be written and executed in the browser using Google Colab, which provides a Jupyter notebook environment and access to computing resources like GPUs. The document then discusses installing Python using Anaconda, basic Python concepts like indentation, variables, strings, conditionals, and loops.
The document provides information about Mohan Arumugam's profile and background as a technologies specialist and consultant. It then provides details about the history and development of the Python programming language, including:
- Python is named after the Monty Python comedy group, not the snake.
- Key releases and features added in each major Python version from 0.9.0 to 3.12.
- Python's growing popularity and widespread adoption in various domains like web development, data science, AI, and automation.
- Details about Python 3.12, including new features like generic classes/functions, f-string grammar updates, and security improvements.
The document provides an overview and introduction to building custom embedded Linux systems using Yocto Project and OpenEmbedded. It discusses build host requirements, the workflow and layers model, and provides pointers to upstream documentation and resources. The summary highlights key aspects:
- Yocto Project uses a layers model where a BSP layer, core metadata layers (e.g. poky), and additional application layers are combined to build customized embedded images
- Build hosts need certain packages installed like git, tar and development tools as well as some non-standard ones like bc, lzop, and u-boot-tools
- Network issues can cause build failures so it is recommended to pre-fetch sources and share caches between
Python and Oracle : allies for best of data managementLaurent Leturgez
In this presentation, I described Python and how Python can Interact with Oracle database, and Oracle Cloud Infrastructure in various project : from data visualisation to data science.
Speed up large-scale ML/DL offline inference job with AlluxioAlluxio, Inc.
Microsoft used Alluxio to speed up large-scale machine learning inference jobs running on Azure. Alluxio helped optimize data access patterns by caching and prefetching input data while streaming output, reducing I/O stalls and improving GPU utilization. This led to inference jobs completing 18% faster compared to without Alluxio. Further work includes adding write retry handling and adopting Alluxio for training jobs which have different data access patterns.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Things to Consider When Choosing a Website Developer for your Website | FODUUFODUU
Choosing the right website developer is crucial for your business. This article covers essential factors to consider, including experience, portfolio, technical skills, communication, pricing, reputation & reviews, cost and budget considerations and post-launch support. Make an informed decision to ensure your website meets your business goals.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Building Production Ready Search Pipelines with Spark and Milvus
Mastering Python 3 I/O
1. Mastering Python 3 I/O
David Beazley
http://www.dabeaz.com
Presented at PyCon'2010
Atlanta, Georgia
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 1
2. This Tutorial
• It's about a very specific aspect of Python 3
• Maybe the most important part of Python 3
• Namely, the reimplemented I/O system
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 2
3. Why I/O?
• Real programs interact with the world
• They read and write files
• They send and receive messages
• They don't compute Fibonacci numbers
• I/O is at the heart of almost everything that
Python is about (scripting, gluing, frameworks,
C extensions, etc.)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com
3
4. The I/O Problem
• Of all of the changes made in Python 3, it is
my observation that I/O handling changes are
the most problematic for porting
• Python 3 re-implements the entire I/O stack
• Python 3 introduces new programming idioms
• I/O handling issues can't be fixed by automatic
code conversion tools (2to3)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 4
5. The Plan
• We're going to take a detailed top-to-bottom
tour of the whole Python 3 I/O system
• Text handling
• Binary data handling
• System interfaces
• The new I/O stack
• Standard library issues
• Memory views, buffers, etc.
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 5
6. Prerequisites
• I assume that you are already reasonably
familiar with how I/O works in Python 2
• str vs. unicode
• print statement
• open() and file methods
• Standard library modules
• General awareness of I/O issues
• Prior experience with Python 3 not required
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 6
7. Performance Disclosure
• There are some performance tests
• Execution environment for tests:
• 2.4 GHZ 4-Core MacPro, 3GB memory
• OS-X 10.6.2 (Snow Leopard)
• All Python interpreters compiled from
source using same config/compiler
• Tutorial is not meant to be a detailed
performance study so all results should be
viewed as rough estimates
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 7
8. Let's Get Started
• I have made a few support files:
http://www.dabeaz.com/python3io/index.html
• You can try some of the examples as we go
• However, it is fine to just watch/listen and try
things on your own later
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 8
9. Part 1
Introducing Python 3
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 9
10. Syntax Changes
• As you know, Python 3 changes syntax
• print is now a function print()
print("Hello World")
• Exception handling syntax changed slightly
try: added
...
except IOError as e:
...
• Yes, your old code will break
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 10
11. Many New Features
• Python 3 introduces many new features
• Composite string formatting
"{0:10s} {1:10d} {2:10.2f}".format(name, shares, price)
• Dictionary comprehensions
a = {key.upper():value for key,value in d.items()}
• Function annotations
def square(x:int) -> int:
return x*x
• Much more... but that's a different tutorial
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 11
12. Changed Built-ins
• Many of the core built-in operations change
• Examples : range(), zip(), etc.
>>> a = [1,2,3]
>>> b = [4,5,6]
>>> c = zip(a,b)
>>> c
<zip object at 0x100452950>
>>>
• Typically related to iterators/generators
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 12
13. Library Reorganization
• The standard library has been cleaned up
• Especially network/internet modules
• Example : Python 2
from urllib2 import urlopen
u = urlopen("http://www.python.org")
• Example : Python 3
from urllib.request import urlopen
u = urlopen("http://www.python.org")
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 13
14. 2to3 Tool
• There is a tool (2to3) that can be used to
identify (and optionally fix) Python 2 code
that must be changed to work with Python 3
• It's a command-line tool:
bash % 2to3 myprog.py
...
• Critical point : 2to3 can help, but it does not
automate Python 2 to 3 porting
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 14
15. 2to3 Example
• Consider this Python 2 program
# printlinks.py
import urllib
import sys
from HTMLParser import HTMLParser
class LinkPrinter(HTMLParser):
def handle_starttag(self,tag,attrs):
if tag == 'a':
for name,value in attrs:
if name == 'href': print value
data = urllib.urlopen(sys.argv[1]).read()
LinkPrinter().feed(data)
• It prints all <a href="..."> links on a web page
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 15
16. 2to3 Example
• Here's what happens if you run 2to3 on it
bash % 2to3 printlinks.py
...
--- printlinks.py (original)
+++ printlinks.py (refactored)
@@ -1,12 +1,12 @@
-import urllib
It identifies +import urllib.request, urllib.parse, urllib.error
lines that import sys
must be -from HTMLParser import HTMLParser
changed +from html.parser import HTMLParser
class LinkPrinter(HTMLParser):
def handle_starttag(self,tag,attrs):
if tag == 'a':
for name,value in attrs:
- if name == 'href': print value
+ if name == 'href': print(value)
...
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 16
17. Fixed Code
• Here's an example of a fixed code (after 2to3)
import urllib.request, urllib.parse, urllib.error
import sys
from html.parser import HTMLParser
class LinkPrinter(HTMLParser):
def handle_starttag(self,tag,attrs):
if tag == 'a':
for name,value in attrs:
if name == 'href': print(value)
data = urllib.request.urlopen(sys.argv[1]).read()
LinkPrinter().feed(data)
• This is syntactically correct Python 3
• But, it still doesn't work. Do you see why?
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 17
18. Broken Code
• Run it
bash % python3 printlinks.py http://www.python.org
Traceback (most recent call last):
File "printlinks.py", line 12, in <module>
LinkPrinter().feed(data)
File "/Users/beazley/Software/lib/python3.1/html/parser.py",
line 107, in feed
self.rawdata = self.rawdata + data
TypeError: Can't convert 'bytes' object to str implicitly
bash %
Ah ha! Look at that!
• That is an I/O handling problem
• Important lesson : 2to3 didn't find it
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 18
19. Actually Fixed Code
• This version works
import urllib.request, urllib.parse, urllib.error
import sys
from html.parser import HTMLParser
class LinkPrinter(HTMLParser):
def handle_starttag(self,tag,attrs):
if tag == 'a':
for name,value in attrs:
if name == 'href': print(value)
data = urllib.request.urlopen(sys.argv[1]).read()
LinkPrinter().feed(data.decode('utf-8'))
I added this one tiny bit (by hand)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 19
20. Important Lessons
• A lot of things change in Python 3
• 2to3 only fixes really "obvious" things
• It does not, in general, fix I/O problems
• Imagine applying it to a huge framework
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 20
21. Part 2
Working with Text
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 21
22. Making Peace with Unicode
• In Python 3, all text is Unicode
• All strings are Unicode
• All text-based I/O is Unicode
• You really can't ignore it or live in denial
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 22
23. Unicode For Mortals
• I teach a lot of Python training classes
• I rarely encounter programmers who have a
solid grasp on Unicode details (or who even
care all that much about it to begin with)
• What follows : Essential details of Unicode
that all Python 3 programmers must know
• You don't have to become a Unicode expert
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 23
24. Text Representation
• Old-school programmers know about ASCII
• Each character has its own integer byte code
• Text strings are sequences of character codes
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 24
25. Unicode Characters
• Unicode is the same idea only extended
• It defines a standard integer code for every
character used in all languages (except for
fictional ones such as Klingon, Elvish, etc.)
• The numeric value is known as a "code point"
• Typically denoted U+HHHH in conversation
ñ = U+00F1
ε = U+03B5
ઇ = U+0A87
= U+3304
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 25
26. Unicode Charts
• A major problem : There are a lot of codes
• Largest supported code point U+10FFFF
• Code points are organized into charts
http://www.unicode.org/charts
• Go there and you will find charts organized by
language or topic (e.g., greek, math, music, etc.)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 26
28. Unicode String Literals
• Strings can now contain any unicode character
• Example:
t = "That's a spicy jalapeño!"
• Problem : How do you indicate such characters?
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 28
29. Using a Unicode Editor
• If you are using a Unicode-aware editor, you can
type the characters in source code (save as UTF-8)
t = "That's a spicy Jalapeño!"
• Example : "Character & Keyboard" viewer (Mac)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 29
30. Using Unicode Charts
• If you can't type it, use a code-point escape
t = "That's a spicy Jalapeu00f1o!"
• uxxxx - Embeds a Unicode code point in a string
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 30
31. Unicode Escapes
• There are three Unicode escapes
• xhh : Code points U+00 - U+FF
• uhhhh : Code points U+0100 - U+FFFF
• Uhhhhhhhh : Code points > U+10000
• Examples:
a = "xf1" # a = 'ñ'
b = "u210f" # b = ' '
c = "U0001d122" # c = '������'
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 31
32. Using Unicode Charts
• Code points also have descriptive names
• N{name} - Embeds a named character
t = "Spicy JalapeN{LATIN SMALL LETTER N WITH TILDE}o!"
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 32
33. Commentary
• Don't overthink Unicode
• Unicode strings are mostly like ASCII strings
except that there is a greater range of codes
• Everything that you normally do with strings
(stripping, finding, splitting, etc.) still work, but
are expanded
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 33
34. A Caution
• Unicode is mostly like ASCII except when it's not
>>> s = "Jalapexf1o"
>>> t = "Jalapenu0303o"
>>> s
'Jalapeño'
'ñ' = 'n'+'˜' (combining ˜)
>>> t
'Jalapeño'
>>> s == t
False
>>> len(s), len(t)
(8, 9)
>>>
• Many tricky bits if you get into internationalization
• However, that's a different tutorial
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 34
35. Unicode Representation
• Internally, Unicode character codes are
stored as multibyte integers (16 or 32 bits)
t = "Jalapeño"
004a 0061 006c 0061 0070 0065 00f1 006f (UCS-2,16-bits)
0000004a 0000006a 0000006c 00000070 ... (UCS-4,32-bits)
• You can find out using the sys module
>>> sys.maxunicode
65535 # 16-bits
>>> sys.maxunicode
1114111 # 32-bits
• In C, it means a 'short' or 'int' is used
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 35
36. Memory Use
• Yes, text strings in Python 3 require either 2x
or 4x as much memory to store as Python 2
• For example: Read a 10MB ASCII text file
data = open("bigfile.txt").read()
>>> sys.getsizeof(data) # Python 2.6
10485784
>>> sys.getsizeof(data) # Python 3.1 (UCS-2)
20971578
>>> sys.getsizeof(data) # Python 3.1 (UCS-4)
41943100
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 36
37. Performance Impact
• Increased memory use does impact the
performance of string operations that make
copies of large substrings
• Slices, joins, split, replace, strip, etc.
• Example:
timeit("text[:-1]","text='x'*100000")
Python 2.6.4 (bytes) : 11.5 s
Python 3.1.1 (UCS-2) : 24.1 s
Python 3.1.1 (UCS-4) : 47.1 s
• There are more bytes moving around
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 37
38. Performance Impact
• Operations that process strings character
often run at the same speed (or are faster)
• lower, upper, find, regexs, etc.
• Example:
timeit("text.upper()","text='x'*1000")
Python 2.6.4 (bytes) : 9.3s
Python 3.1.1 (UCS-2) : 6.9s
Python 3.1.1 (UCS-4) : 7.0s
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 38
39. Commentary
• Yes, text representation has an impact
• In your programs, you can work with text in
the same way as you always have (text
representation is just an internal detail)
• However, know that the performance may
vary from 8-bit text strings in Python 2
• Study it if working with huge amounts of text
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 39
40. Issue : Text Encoding
• The internal representation of characters is now
almost never the same as how text is transmitted
or stored in files
Text File Hello World
File content 48 65 6c 6c 6f 20 57 6f 72 6c 64 0a
(ASCII bytes)
read() write()
Python String 00000048 00000065 0000006c 0000006c
Representation 0000006f 00000020 00000057 0000006f
00000072 0000006c 00000064 0000000a
(UCS-4, 32-bit ints)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 40
41. Issue : Text Encoding
• There are also many possible file encodings
for text (especially for non-ASCII)
"Jalapeño"
latin-1 4a 61 6c 61 70 65 f1 6f
cp437 4a 61 6c 61 70 65 a4 6f
utf-8 4a 61 6c 61 70 65 c3 b1 6f
utf-16 ff fe 4a 00 61 00 6c 00 61 00
70 00 65 00 f1 00 6f 00
• Emphasize : They are only related to how
text is stored in files, not stored in memory
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 41
42. I/O Encoding
• All text is now encoded and decoded
• If reading text, it must be decoded from its
source format into Python strings
• If writing text, it must be encoded into some
kind of well-known output format
• This is a major difference between Python 2
and Python 3. In Python 2, you could write
programs that just ignored encoding and
read text as bytes (ASCII).
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 42
43. Reading/Writing Text
• Built-in open() function has an optional
encoding parameter
f = open("somefile.txt","rt",encoding="latin-1")
• If you omit the encoding, UTF-8 is assumed
>>> f = open("somefile.txt","rt")
>>> f.encoding
'UTF-8'
>>>
• Also, in case you're wondering, text file modes
should be specified as "rt","wt","at", etc.
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 43
44. Standard I/O
• Standard I/O streams also have encoding
>>> import sys
>>> sys.stdin.encoding
'UTF-8'
>>> sys.stdout.encoding
'UTF-8'
>>>
• Be aware that the encoding might change
depending on the locale settings
>>> import sys
>>> sys.stdout.encoding
'US-ASCII'
>>>
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 44
45. Binary File Modes
• Writing text on binary-mode files is an error
>>> f = open("foo.bin","wb")
>>> f.write("Hello Worldn")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: must be bytes or buffer, not str
>>>
• For binary I/O, Python 3 will never implicitly
encode unicode strings and write them
• You must either use a text-mode file or
explicitly encode (str.encode('encoding'))
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 45
46. Important Encodings
• If you're not doing anything with Unicode
(e.g., just processing ASCII files), there are
still three encodings you should know
• ASCII
• Latin-1
• UTF-8
• Will briefly describe each one
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 46
47. ASCII Encoding
• Text that is restricted to 7-bit ASCII (0-127)
• Any characters outside of that range
produce an encoding error
>>> f = open("output.txt","wt",encoding="ascii")
>>> f.write("Hello Worldn")
12
>>> f.write("Spicy Jalapeñon")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode
character 'xf1' in position 12: ordinal not in
range(128)
>>>
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 47
48. Latin-1 Encoding
• Text that is restricted to 8-bit bytes (0-255)
• Byte values are left "as-is"
>>> f = open("output.txt","wt",encoding="latin-1")
>>> f.write("Spicy Jalapeñon")
15
>>>
• Most closely emulates Python 2 behavior
• Also known as "iso-8859-1" encoding
• Pro tip: This is the fastest encoding for pure
8-bit text (ASCII files, etc.)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 48
50. UTF-8 Encoding
• Main feature of UTF-8 is that ASCII is
embedded within it
• If you're never working with international
characters, UTF-8 will work transparently
• Usually a safe default to use when you're not
sure (e.g., passing Unicode strings to
operating system functions, interfacing with
foreign software, etc.)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 50
51. Interlude
• If migrating from Python 2, keep in mind
• Python 3 strings use multibyte integers
• Python 3 always encodes/decodes I/O
• If you don't say anything about encoding,
Python 3 assumes UTF-8
• Everything that you did before should work
just fine in Python 3 (probably)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 51
52. New Printing
• In Python 3, print() is used for text output
• Here is a mini porting guide
Python 2 Python 3
print x,y,z print(x,y,z)
print x,y,z, print(x,y,z,end=' ')
print >>f,x,y,z print(x,y,z,file=f)
• However, print() has a few new tricks not
available in Python 2
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 52
53. Printing Enhancements
• Picking a different item separator
>>> print(1,2,3,sep=':')
1:2:3
>>> print("Hello","World",sep='')
HelloWorld
>>>
• Picking a different line ending
>>> print("What?",end="!?!n")
What?!?!
>>>
• Relatively minor, but these features are often
requested (e.g., "how do I get rid of the space?")
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 53
54. Discussion : New Idioms
• In Python 2, you might have code like this
print ",".join([name,shares,price])
• Which of these is better in Python 3?
print(",".join([name,shares,price]))
- or -
print(name, shares, price, sep=",")
• Overall, I think I like the second one (even
though it runs a little bit slower)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 54
55. New String Formatting
• Python 3 has completely revised formatting
• Here is old Python (%)
s = "%10s %10d %10.2f" % (name, shares, price)
• Here is Python 3
s = "{0:10s} {1:10d} {2:10.2f}".format(name,shares,price)
• You might find the new formatting jarring
• Let's talk about it
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 55
56. First, Some History
• String formatting is one of the few features
of Python 2 that can't be customized
• Classes can define __str__() and __repr__()
• However, they can't customize % processing
• Python 2.6/3.0 adds a __format__() special
method that addresses this in conjunction
with some new string formatting machinery
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 56
57. String Conversions
• Objects now have three string conversions
>>> x = 1/3
>>> x.__str__()
'0.333333333333'
>>> x.__repr__()
'0.3333333333333333'
>>> x.__format__("0.2f")
'0.33'
>>> x.__format__("20.2f")
' 0.33'
>>>
• You will notice that __format__() takes a
code similar to those used by the % operator
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 57
58. format() function
• format(obj, fmt) calls __format__
>>> x = 1/3
>>> format(x,"0.2f")
'0.33'
>>> format(x,"20.2f")
' 0.33'
>>>
• This is analogous to str() and repr()
>>> str(x)
'0.333333333333'
>>> repr(x)
'0.3333333333333333'
>>>
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 58
59. Format Codes (Builtins)
• For builtins, there are standard format codes
Old Format New Format Description
"%d" "d" Decimal Integer
"%f" "f" Floating point
"%s" "s" String
"%e" "e" Scientific notation
"%x" "x" Hexadecimal
• Plus there are some brand new codes
"o" Octal
"b" Binary
"%" Percent
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 59
60. Format Examples
• Examples of simple formatting
>>> x = 42
>>> format(x,"x")
'2a'
>>> format(x,"b")
'101010'
>>> y = 2.71828
>>> format(y,"f")
'2.718280'
>>> format(y,"e")
'2.718280e+00'
>>> format(y,"%")
'271.828000%'
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 60
61. Format Modifiers
• Field width and precision modifiers
[width][.precision]code
• Examples:
>>> y = 2.71828
>>> format(y,"0.2f")
'2.72'
>>> format(y,"10.4f")
' 2.7183'
>>>
• This is exactly the same convention as with
the legacy % string formatting
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 61
62. Alignment Modifiers
• Alignment Modifiers
[<|>|^][width][.precision]code
< left align
> right align
^ center align
• Examples:
>>> y = 2.71828
>>> format(y,"<20.2f")
'2.72 '
>>> format(y,"^20.2f")
' 2.72 '
>>> format(y,">20.2f")
' 2.72'
>>>
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 62
63. Fill Character
• Fill Character
[fill][<|>|^][width][.precision]code
• Examples:
>>> x = 42
>>> format(x,"08d")
'00000042'
>>> format(x,"032b")
'00000000000000000000000000101010'
>>> format(x,"=^32d")
'===============42==============='
>>>
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 63
64. Thousands Separator
• Insert a ',' before the precision specifier
[fill][<|>|^][width][,][.precision]code
• Examples:
>>> x = 123456789
>>> format(x,",d")
'123,456,789'
>>> format(x,"10,.2f")
'123,456,789.00'
>>>
• This is pretty new (see PEP 378)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 64
65. Discussion
• As you can see, there's a lot of flexibility in
the new format method (there are other
features not shown here)
• User-defined objects can also completely
customize their formatting if they implement
__format__(self,fmt)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 65
66. String .format() Method
• Strings have .format() method for formatting
multiple values at once (replacement for %)
>>> "{0:10s} {1:10d} {2:10.2f}".format('ACME',50,91.10)
'ACME 50 91.10'
>>>
• format() method looks for formatting
specifiers enclosed in { } and expands them
• Each {} is similar to a %fmt specifier with the
old string formatting
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 66
67. Format Specifiers
• Each specifier has the form : {what:fmt}
• what. Indicates what is being formatted
(refers to one of the arguments supplied
to the format() method)
• fmt. A format code. The same as what is
supplied to the format() function
• Each {what:fmt} gets replaced by the result of
format(what,fmt)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 67
68. Formatting Illustrated
• Arguments specified by position {n:fmt}
"{0:10s} {2:10.2f}".format('ACME',50,91.10)
• Arguments specified by keyword {key:fmt}
"{name:10s} {price:10.2f}".format(name='ACME',price=91.10)
• Arguments formatted in order {:fmt}
"{:10s} {:10d} {:10.2f}".format('ACME',50,91.10)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 68
69. Container Lookups
• You can index sequences and dictionaries
>>> stock = ('ACME',50,91.10)
>>> "{s[0]:10s} {s[2]:10.2f}".format(s=stock)
'ACME 91.10'
>>> stock = {'name':'ACME', 'shares':50, 'price':91.10 }
>>> "{0[name]:10s} {0[price]:10.2f}".format(stock)
'ACME 91.10'
>>>
• Restriction :You can't put arbitrary expressions
in the [] lookup (has to be a number or simple
string identifier)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 69
70. Attribute Access
• You can refer to instance attributes
class Stock(object):
def __init__(self,name,shares,price):
self.name = name
self.shares = shares
self.price = price
>>> s = Stock('ACME',50,91.10)
>>> "{0.name:10s} {0.price:10.2f}".format(s)
'ACME 91.10'
>>>
• Commentary : Nothing remotely like this with
the old string formatting operator
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 70
71. Nested Format Expansion
• .format() allows one level of nested lookups in
the format part of each {}
>>> s = ('ACME',50,91.10)
>>> "{0:{width}s} {2:{width}.2f}".format(*s,width=12)
'ACME 91.10'
>>>
• Probably best not to get too carried away in
the interest of code readability though
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 71
72. Other Formatting Details
• { and } must be escaped if part of formatting
• Use '{{ for '{'
• Use '}}' for '}'
• Example:
>>> "The value is {{{0}}}".format(42)
'The value is {42}'
>>>
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 72
73. Commentary
• The new string formatting is very powerful
• However, I'll freely admit that it still feels very
foreign to me (maybe it's due to my long
history with using printf-style formatting)
• Python 3 still has the % operator, but it may
go away some day (I honestly don't know).
• All things being equal, you probably want to
embrace the new formatting
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 73
74. Part 3
Binary Data Handling and Bytes
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 74
75. Bytes and Byte Arrays
• Python 3 has support for "byte-strings"
• Two new types : bytes and bytearray
• They are quite different than Python 2 strings
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 75
76. Defining Bytes
• Here's how to define byte "strings"
a = b"ACME 50 91.10" # Byte string literal
b = bytes([1,2,3,4,5]) # From a list of integers
c = bytes(10) # An array of 10 zero-bytes
d = bytes("Jalapeño","utf-8") # Encoded from string
• Can also create from a string of hex digits
e = bytes.fromhex("48656c6c6f")
• All of these define an object of type "bytes"
>>> type(a)
<class 'bytes'>
>>>
• However, this new bytes object is an odd duck
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 76
77. Bytes as Strings
• Bytes have standard "string" operations
>>> s = b"ACME 50 91.10"
>>> s.split()
[b'ACME', b'50', b'91.10']
>>> s.lower()
b'acme 50 91.10'
>>> s[5:7]
b'50'
• And bytes are immutable like strings
>>> s[0] = b'a'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'bytes' object does not support item assignment
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 77
78. Bytes as Integers
• Unlike Python 2, bytes are arrays of integers
>>> s = b"ACME 50 91.10"
>>> s[0]
65
>>> s[1]
67
>>>
• Same for iteration
>>> for c in s: print(c,end=' ')
65 67 77 69 32 53 48 32 57 49 46 49 48
>>>
• Hmmmm. Curious.
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 78
79. bytearray objects
• A bytearray is a mutable bytes object
>>> s = bytearray(b"ACME 50 91.10")
>>> s[:4] = b"PYTHON"
>>> s
bytearray(b"PYTHON 50 91.10")
>>> s[0] = 0x70 # Must assign integers
>>> s
bytearray(b'pYTHON 50 91.10")
>>>
• It also gives you various list operations
>>> s.append(23)
>>> s.append(45)
>>> s.extend([1,2,3,4])
>>> s
bytearray(b'ACME 50 91.10x17-x01x02x03x04')
>>>
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 79
80. An Observation
• bytes and bytearray are not really meant to
mimic Python 2 string objects
• They're closer to array.array('B',...) objects
>>> import array
>>> s = array.array('B',[10,20,30,40,50])
>>> s[1]
20
>>> s[1] = 200
>>> s.append(100)
>>> s.extend([65,66,67])
>>> s
array('B', [10, 200, 30, 40, 50, 100, 65, 66, 67])
>>>
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 80
81. Bytes and Strings
• Bytes are not meant for text processing
• In fact, if you try to use them for text, you will
run into weird problems
• Python 3 strictly separates text (unicode) and
bytes everywhere
• This is probably the most major difference
between Python 2 and 3.
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 81
82. Mixing Bytes and Strings
• Mixed operations fail miserably
>>> s = b"ACME 50 91.10"
>>> 'ACME' in s
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Type str doesn't support the buffer API
>>>
• Huh?!?? Buffer API?
• We'll cover that later...
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 82
83. Printing Bytes
• Printing and text-based I/O operations do not
work in a useful way with bytes
>>> s = b"ACME 50 91.10"
>>> print(s)
b'ACME 50 91.10'
>>>
Notice the leading b' and trailing
quote in the output.
• There's no way to fix this. print() should only
be used for outputting text (unicode)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 83
84. Formatting Bytes
• Bytes do not support operations related to
formatted output (%, .format)
>>> s = b"%0.2f" % 3.14159
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for %: 'bytes' and
'float'
>>>
• So, just forget about using bytes for any kind of
useful text output, printing, etc.
• No, seriously.
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 84
85. Commentary
• Why am I focusing on this "bytes as text" issue?
• If you are writing scripts that do simple ASCII
text processing, you might be inclined to use
bytes as a way to avoid the overhead of Unicode
• You might think that bytes are exactly the same
as the familiar Python 2 string object
• This is wrong. Bytes are not text. Using bytes as
text will lead to convoluted non-idiomatic code
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 85
86. How to Use Bytes
• To use the bytes objects, focus on problems
related to low-level I/O handling (message
passing, distributed computing, etc.)
• I will show some examples that illustrate
• A complaint: documentation (online and
books) is extremely thin on explaining
practical uses of bytes and bytearray objects
• Hope to rectify that a little bit here
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 86
87. Example : Reassembly
• In Python 2, you may know that string
concatenation leads to bad performance
msg = ""
while True:
chunk = s.recv(BUFSIZE)
if not chunk:
break
msg += chunk
• Here's the common workaround (hacky)
chunks = []
while True:
chunk = s.recv(BUFSIZE)
if not chunk:
break
chunks.append(chunk)
msg = b"".join(chunks)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 87
88. Example : Reassembly
• Here's a new approach in Python 3
msg = bytearray()
while True:
chunk = s.recv(BUFSIZE)
if not chunk:
break
msg.extend(chunk)
• You treat the bytearray as a list and just
append/extend new data at the end as you go
• I like it. It's clean and intuitive.
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 88
89. Example: Reassembly
• The performance is good too
• Concat 1024 32-byte chunks together (10000x)
Concatenation : 18.49s
Joining : 1.55s
Extending a bytearray : 1.78s
• There are many parts of the Python standard
library that might benefit (e.g., ByteIO objects,
WSGI, multiprocessing, pickle, etc.)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 89
90. Example: Record Packing
• Suppose you wanted to use the struct module
to incrementally pack a large binary message
objs = [ ... ] # List of tuples to pack
msg = bytearray() # Empty message
# First pack the number of objects
msg.extend(struct.pack("<I",len(objs)))
# Incrementally pack each object
for x in objs:
msg.extend(struct.pack(fmt, *x))
# Do something with the message
f.write(msg)
• I like this as well.
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 90
91. Comment : Writes
• The previous example is one way to avoid
making lots of small write operations
• Instead you collect data into one large message
that you output all at once.
• Improves I/O performance and code is nice
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 91
92. Example : Calculations
• Run a byte array through an XOR-cipher
>>> s = b"Hello World"
>>> t = bytes(x^42 for x in s)
>>> t
b'bOFFEn}EXFN'
>>> bytes(x^42 for x in t)
b'Hello World'
>>>
• Compute and append a LRC checksum to a msg
# Compute the checksum and append at the end
chk = 0
for n in msg:
chk ^= n
msg.append(chk)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 92
93. Commentary
• I'm excited about the new bytearray object
• Many potential uses in building low-level
infrastructure for networking, distributed
computing, messaging, embedded systems, etc.
• May make much of that code cleaner, faster, and
more memory efficient
• Still more features to come...
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 93
94. Part 4
System Interfaces
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 94
95. System Interfaces
• Major parts of the Python library are related to
low-level systems programming, sysadmin, etc.
• os, os.path, glob, subprocess, socket, etc.
• Unfortunately, there are some really sneaky
aspects of using these modules with Python 3
• It concerns the Unicode/Bytes separation
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 95
96. The Problem
• To carry out system operations, the Python
interpreter executes standard C system calls
• For example, POSIX calls on Unix
int fd = open(filename, O_RDONLY);
• However, names used in system interfaces (e.g.,
filenames, program names, etc.) are specified as
byte strings (char *)
• Bytes also used for environment variables and
command line options
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 96
97. Question
• How does Python 3 integrate strings (Unicode)
with byte-oriented system interfaces?
• Examples:
• Filenames
• Command line arguments (sys.argv)
• Environment variables (os.environ)
• Note:You should care about this if you use
Python for various system tasks
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 97
98. Name Encoding
• Standard practice is for Python 3 to UTF-8
encode all names passed to system calls
Python : f = open("somefile.txt","wt")
encode('utf-8')
C/syscall : open("somefile.txt",O_WRONLY)
• This is usually a safe bet
• ASCII is a subset and UTF-8 is an extension that
most operating systems support
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 98
100. Lurking Danger
• Be aware that some systems accept, but do not
strictly enforce UTF-8 encoding of names
• This is extremely subtle, but it means that names
used in system interfaces don't necessarily
match the encoding that Python 3 wants
• Will show a pathological example to illustrate
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 100
101. Example : A Bad Filename
• Start Python 2.6 on Linux and create a file using
the open() function like this:
>>> f = open("jalapexf1o.txt","w")
>>> f.write("Bwahahahaha!n")
>>> f.close()
• This creates a file with a single non-ASCII byte
(xf1, 'ñ') embedded in the filename
• The filename is not UTF-8, but it still "works"
• Question: What happens if you try to do
something with that file in Python 3?
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 101
102. Example : A Bad Filename
• Python 3 won't be able to open the file
>>> f = open("jalapexf1o.txt")
Traceback (most recent call last):
...
IOError: [Errno 2] No such file or directory: 'jalapeño.txt'
>>>
• This is caused by an encoding mismatch
"jalapexf1o.txt"
UTF-8
b"jalapexc3xb1o.txt"
It fails because this is
open()
the actual filename
Fails! b"jalapexf1o.txt"
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 102
103. Example : A Bad Filename
• Bad filenames cause weird behavior elsewhere
• Directory listings
• Filename globbing
• Example : What happens if a non UTF-8 name
shows up in a directory listing?
• In early versions of Python 3, such names were
silently discarded (made invisible). Yikes!
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 103
104. Names as Bytes
• You can specify filenames using byte strings
instead of strings as a workaround
>>> f = open(b"jalapexf1o.txt")
>>>
Notice bytes
>>> files = glob.glob(b"*.txt")
>>> files
[b'jalapexf1o.txt', b'spam.txt']
>>>
• This turns off the UTF-8 encoding and returns
all results as bytes
• Note: Not obvious and a little hacky
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 104
105. Surrogate Encoding
• In Python 3.1, non-decodable (bad) characters in
filenames and other system interfaces are
translated using "surrogate encoding" as
described in PEP 383.
• This is a Python-specific "trick" for getting
characters that don't decode as UTF-8 to pass
through system calls in a way where they still
work correctly
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 105
106. Surrogate Encoding
• Idea : Any non-decodable bytes in the range
0x80-0xff are translated to Unicode characters
U+DC80-U+DCFF
• Example: b"jalapexf1o.txt"
surrogate encoding
"jalapeudcf1o.txt"
• Similarly, Unicode characters U+DC80-U+DCFF
are translated back into bytes 0x80-0xff when
presented to system interfaces
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 106
107. Surrogate Encoding
• You will see this used in various library functions
and it works for functions like open()
• Example:
>>> glob.glob("*.txt")
[ 'jalapeudcf1o.txt', 'spam.txt']
notice the odd unicode character
>>> f = open("jalapeudcf1o.txt")
>>>
• If you ever see a udcxx character, it means that
a non-decodable byte was passed in from a
system interface
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 107
108. Surrogate Encoding
• Question : Does this break part of Unicode?
• Answer : Unsure
• This uses a range of Unicode dedicated for a
feature known as "surrogate pairs". A pair of
Unicode characters encoded like this
(U+D800-U+DBFF, U+DC00-U+DFFF)
• In Unicode, you would never see a U+DCxx
character appearing all on its own
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 108
109. Caution : Printing
• Non-decodable bytes will break print()
>>> files = glob.glob("*.txt")
>>> files
[ 'jalapeudcf1o.txt', 'spam.txt']
>>> for name in files:
... print(name)
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character
'udcf1' in position 6: surrogates not allowed
>>>
• Arg! If you're using Python for file manipulation
or system administration you need to be careful
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 109
110. Implementation
• Surrogate encoding is implemented as an error
handler for encode() and decode()
• Example:
>>> s = b"jalapexf1o.txt"
>>> t = s.decode('utf-8','surrogateescape')
>>> t
'jalapeudcf1o.txt'
>>> t.encode('utf-8','surrogateescape')
b'jalapexf1o.txt'
>>>
• If you are porting code that deals with system
interfaces, you might need to do this
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 110
111. Commentary
• This handling of Unicode in system interfaces is
also of interest to C/C++ extensions
• What happens if a C/C++ function returns an
improperly encoded byte string?
• What happens in ctypes? Swig?
• Seems unexplored (too obscure? new?)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 111
112. Part 5
The io module
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 112
113. I/O Implementation
• I/O in Python 2 is largely based on C I/O
• For example, the "file" object is just a thin layer
over a C "FILE *" object
• Python 3 changes this
• In fact, Python 3 has a complete ground-up
reimplementation of the whole I/O system
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 113
114. The open() function
• For files, you still use open() as you did before
• However, the result of calling open() varies
depending on the file mode and buffering
• Carefully study the output of this:
>>> open("foo.txt","rt")
<_io.TextIOWrapper name='foo.txt' encoding='UTF-8'>
Notice how
>>> open("foo.txt","rb")
you're getting a
<_io.BufferedReader name='foo.txt'>
different kind of
>>> open("foo.txt","rb",buffering=0)
result here <_io.FileIO name='foo.txt' mode='rb'>
>>>
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 114
115. The io module
• The core of the I/O system is implemented in
the io library module
• It consists of a collection of different I/O classes
FileIO
BufferedReader
BufferedWriter
BufferedRWPair
BufferedRandom
TextIOWrapper
BytesIO
StringIO
• Each class implements a different kind of I/O
• The classes get layered to add features
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 115
116. Layering Illustrated
• Here's the result of opening a "text" file
open("foo.txt","rt")
TextIOWrapper
BufferedReader
FileIO
• Keep in mind: This is very different from Python 2
• Inspired by Java? (don't know, maybe)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 116
117. FileIO Objects
• An object representing raw unbuffered binary I/O
• FileIO(name [, mode [, closefd])
name : Filename or integer fd
mode : File mode ('r', 'w', 'a', 'r+',etc.)
closefd : Flag that controls whether close() called
• Under the covers, a FileIO object is directly
layered on top of operating system functions
such as read(), write()
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 117
118. FileIO Usage
• FileIO replaces os module functions
• Example : Python 2 (os module)
fd = os.open("somefile",os.O_RDONLY)
data = os.read(fd,4096)
os.lseek(fd,16384,os.SEEK_SET)
...
• Example : Python 3 (FileIO object)
f = io.FileIO("somefile","r")
data = f.read(4096)
f.seek(16384,os.SEEK_SET)
...
• It's a low-level file with a file-like interface (nice)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 118
119. Direct System I/O
• FileIO directly exposes the behavior of low-level
system calls on file descriptors
• This includes:
• Partial read/writes
• Returning system error codes
• Blocking/nonblocking I/O handling
• System hackers want this
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 119
120. Direct System I/O
• File operations (read/write) execute a single
system call no matter what
data = f.read(8192) # Executes one read syscall
f.write(data) # Executes one write syscall
• This might mean partial data (you must check)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 120
121. Commentary
• FileIO is the most critical object in the I/O stack
• Everything else depends on it
• Nothing quite like it in Python 2
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 121
122. BufferedIO Objects
• The following classes implement buffered I/O
BufferedReader(f [, buffer_size])
BufferedWriter(f [, buffer_size [, max_buffer_size]])
BufferedRWPair(f_read, f_write
[, buffer_size [, max_buffer_size]])
BufferedRandom(f [, buffer_size [, max_buffer_size]])
• Each of these classes is layered over a supplied
raw FileIO object (f)
f = io.FileIO("foo.txt") # Open the file (raw I/O)
g = io.BufferedReader(f) # Put buffering around it
f = io.BufferedReader(io.FileIO("foo.txt")) # Alternative
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 122
123. Buffered Operations
• Buffered readers implement these methods
f.peek([n]) # Return up to n bytes of data without
# advancing the file pointer
f.read([n]) # Return n bytes of data as bytes
f.read1([n]) # Read up to n bytes using a single
# read() system call
• Buffered writers implement these methods
f.write(bytes) # Write bytes
f.flush() # Flush output buffers
• Other ops (seek, tell, close, etc.) work as well
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 123
124. TextIOWrapper
• The object that implements text-based I/O
TextIOWrapper(buffered [, encoding [, errors
[, newline [, line_buffering]]]])
buffered - A buffered file object
encoding - Text encoding (e.g., 'utf-8')
errors - Error handling policy (e.g. 'strict')
newline - '', 'n', 'r', 'rn', or None
line_buffering - Flush output after each line (False)
• It is layered on a buffered I/O stream
f = io.FileIO("foo.txt") # Open the file (raw I/O)
g = io.BufferedReader(f) # Put buffering around it
h = io.TextIOWrapper(g,"utf-8") # Text I/O wrapper
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 124
125. TextIOWrapper and codecs
• Python 2 used the codecs module for unicode
• TextIOWrapper It is a completely new object,
written almost entirely in C
• It kills codecs.open() in performance
for line in open("biglog.txt",encoding="utf-8"): 3.8 sec
pass
f = codecs.open("biglog.txt",encoding="utf-8") 53.3 sec
for line in f:
pass
Note: both tests performed using Python-3.1.1
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 125
126. Putting it All Together
• As a user, you don't have to worry too much
about how the different parts of the I/O system
are put together (all of the different classes)
• The built-in open() function constructs the
proper set of IO objects depending on the
supplied parameters
• Power users might use the io module directly
for more precise control over special cases
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 126
127. open() Revisited
• Here is the full prototype
open(name [, mode [, buffering [, encoding [, errors
[, newline [, closefd]]]]]])
• The different parameters get passed to
underlying objects that get created
name
mode FileIO
closefd
buffering BufferedReader, BufferedWriter
encoding
errors TextIOWrapper
newline
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 127
128. open() Revisited
• The type of IO object returned depends on the
supplied mode and buffering parameters
mode buffering Result
any binary 0 FileIO
"rb" != 0 BufferedReader
"wb","ab" != 0 BufferedWriter
"rb+","wb+","ab+" != 0 BufferedRandom
any text != 0 TextIOWrapper
• Note: Certain combinations are illegal and will
produce an exception (e.g., unbuffered text)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 128
129. Unwinding the I/O Stack
• Sometimes you might need to unwind a file
open("foo.txt","rt")
TextIOWrapper
.buffer
BufferedReader
.raw
FileIO
• Scenario :You were given an open text-mode
file, but want to use it in binary mode
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 129
130. I/O Performance
• Question : How does new I/O perform?
• Will compare:
• Python 2.6.4 built-in open()
• Python 3.1.1 built-in open()
• Note: This is not exactly a fair test--the Python 3
open() has to decode Unicode text
• However, it's realistic, because most programmers
use open() without thinking about it
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 130
131. I/O Performance
• Read a 100 Mbyte text file all at once
data = open("big.txt").read()
Python 2.6.4 : 0.16s Yes, you get
Python 3.1 (UCS-2, UTF-8) : 0.95s overhead due to
Python 3.1 (UCS-4, UTF-8) : 1.67s text decoding
• Read a 100 Mbyte binary file all at once
data = open("big.bin","rb").read()
Python 2.6.4 : 0.16s
(I couldn't observe any
Python 3.1 (UCS-2, UTF-8) : 0.16s
noticeable difference)
Python 3.1 (UCS-4, UTF-8) : 0.16s
• Note: tests conducted with warm disk cache
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 131
132. I/O Performance
• Write a 100 Mbyte text file all at once
open("foo.txt","wt").write(text)
Python 2.6.4 : 2.30s
Python 3.1 (UCS-2, UTF-8) : 2.47s
Python 3.1 (UCS-4, UTF-8) : 2.55s
• Write a 100 Mbyte binary file all at once
data = open("big.bin","wb").write(data)
Python 2.6.4 : 2.16s
(I couldn't observe any
Python 3.1 (UCS-2, UTF-8) : 2.16s
noticeable difference)
Python 3.1 (UCS-4, UTF-8) : 2.16s
• Note: tests conducted with warm disk cache
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 132
133. I/O Performance
• Iterate over 730000 lines of a big log file (text)
for line in open("biglog.txt"):
pass
Python 2.6.4 : 0.24s
Python 3.1 (UCS-2, UTF-8) : 0.57s
Python 3.1 (UCS-4, UTF-8) : 0.82s
• Iterate over 730000 lines of a log file (binary)
for line in open("biglog.txt","rb"):
pass
Python 2.6.4 : 0.24s
Python 3.1 (UCS-2, UTF-8) : 0.29s
Python 3.1 (UCS-4, UTF-8) : 0.29s
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 133
134. I/O Performance
• Write 730000 lines log data (text)
open("biglog.txt","wt").writelines(lines)
Note: higher variance in
Python 2.6.4 : 1.3s
observed times. These
Python 3.1 (UCS-2, UTF-8) : 1.4s
are 10 sample averages
Python 3.1 (UCS-4, UTF-8) : 1.4s
(rough ballpark)
• Write 730000 lines of log data (binary)
for line in open("biglog.txt","wb"):
pass
Python 2.6.4 : 1.3s
Python 3.1 (UCS-2, UTF-8) : 1.3s
Python 3.1 (UCS-4, UTF-8) : 1.3s
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 134
135. Commentary
• For binary, the Python 3 I/O system is
comparable to Python 2 in performance
• Text based I/O has an unavoidable penalty
• Extra decoding (UTF-8)
• An extra memory copy
• You might be able to minimize the decoding
penalty by specifying 'latin-1' (fastest)
• The memory copy can't be eliminated
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 135
136. Commentary
• Reading/writing always involves bytes
"Hello World" -> 48 65 6c 6c 6f 20 57 6f 72 6c 64
• To get it to Unicode, it has to be copied to
multibyte integers (no workaround)
48 65 6c 6c 6f 20 57 6f 72 6c 64
Unicode conversion
0048 0065 006c 006c 006f 0020 0057 006f 0072 006c 0064
• The only way to avoid this is to never convert
bytes into a text string (not always practical)
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 136
137. Advice
• Heed the advice of the optimization gods---ask
yourself if it's really worth worrying about
(premature optimization as the root of all evil)
• No seriously... does it matter for your app?
• If you are processing huge (no, gigantic) amounts
of 8-bit text (ASCII, Latin-1, UTF-8, etc.) and I/O
has been determined to be the bottleneck, there
is one approach to optimization that might work
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 137
138. Text Optimization
• Perform all I/O in binary/bytes and defer
Unicode conversion to the last moment
• If you're filtering or discarding huge parts of the
text, you might get a big win
• Example : Log file parsing
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 138
139. Example
• Find all URLs that 404 in an Apache log
140.180.132.213 - - [...] "GET /ply/ply.html HTTP/1.1" 200 97238
140.180.132.213 - - [...] "GET /favicon.ico HTTP/1.1" 404 133
• Processing everything as text
error_404_urls = set()
for line in open("biglog.txt"):
fields = line.split()
if fields[-2] == '404':
error_404_urls.add(fields[-4])
for name in error_404_urls:
print(name)
Python 2.6.4 : 1.21s
Python 3.1 (UCS-2) : 2.12s
Python 3.1 (UCS-4) : 2.56s
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 139
140. Example Optimization
• Deferred text conversion
error_404_urls = set()
for line in open("biglog.txt","rb"):
fields = line.split()
if fields[-2] == b'404':
error_404_urls.add(fields[-4])
for name in error_404_urls:
print(name.decode('latin-1'))
Unicode conversion here
Python 2.6.4 : 1.21s
Python 3.1 (UCS-2) : 1.21s
Python 3.1 (UCS-4) : 1.26s
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 140
141. Part 6
Standard Library Issues
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 141
142. Text, Bytes, and the Library
• In Python 2, you could be sloppy about the
distinction between text and bytes in many
library functions
• Networking modules
• Data handling modules
• Various sorts of conversions
• In Python 3, you must be very precise
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 142
143. Example : Socket Sends
• Here's a skeleton of some sloppy Python 2 code
def send_response(s,code,msg):
s.sendall("HTTP/1.0 %s %srn" % (code,msg))
send_response(s,"200","OK")
• This is almost guaranteed to break
• Reason : Almost every library function that
communicates with the outside world (sockets,
urllib, SocketServer, etc.) now uses binary I/O
• So, text operations are going to fail
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 143
144. Example : Socket Sends
• In Python 3, you must explicitly encode text
def send_response(s,code,msg):
resp = "HTTP/1.0 {:s} {:s}rn".format(code,msg)
s.sendall(resp.encode('ascii'))
send_response(s,"200","OK")
• Commentary :You really should have been doing
this in Python 2 all along
Copyright (C) 2010, David Beazley, http://www.dabeaz.com 144