smart_open at Data Science London meetup

smart_open
Streaming large files with a simple Pythonic API to and from
S3, HDFS, WebHDFS, even zip and local files
Lev Konstantinovskiy

What?
smart_open is
a Python 2 and 3 library
for efficient streaming of very large files
with a simple Pythonic API
in 600 lines of code.

Easily switch just the path when data are moved for example from laptop
to S3.
smart_open.smart_open('./foo.txt')
smart_open.smart_open('./foo.txt.gz')
smart_open.smart_open('s3://mybucket/mykey.txt')
smart_open.smart_open('hdfs://user/hadoop/my_file.txt

Who?
Open-source MIT License. Maintained by RaRe
Technologies. Headed by Radim Rehurek aka
piskvorky.

Why?
- Originally part of gensim - an out-of-core
open-source text processing library (word2vec,
LDA etc). smart_open is used for streaming
large text corpora.

Why?
Boto is not Pythonistic :(
- Study 15 pages of boto book
before using S3
Solution:
smart_open is Pythonised boto

What is “Pythonistic”?
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
PEP 20. The zen of Python

Write more than 5GB to S3: multipart-ing in Boto
>>> mp = b.initiate_multipart_upload(os.path.basename(source_path))
>>> chunk_size = 52428800; chunk_count = int(math.ceil(source_size /
float(chunk_size))) # Use a chunk size of 50 MiB
# Send the file parts, using FileChunkIO to create a file-like object
# that points to a certain byte range within the original file. We
# set bytes to never exceed the original file size.
>>> for i in range(chunk_count):
>>> offset = chunk_size * i
>>> bytes = min(chunk_size, source_size - offset)
>>> with FileChunkIO(source_path, 'r', offset=offset,
bytes=bytes) as fp:
>>> mp.upload_part_from_file(fp, part_num=i + 1)
# Finish the upload
>>> mp.complete_upload()
#Note that if you forget to
call either
mp.complete_upload() or
mp.cancel_upload() you will
be left with an incomplete
upload and charged for the
storage consumed by the
uploaded parts. A call to
bucket.get_all_multipart_upl
oads() can help to show lost
multipart upload parts.

Write more than 5GB to S3: multipart-ing in
smart_open
>>> # stream content *into* S3 (write mode,
multiparting behind the screen):
>>> with
smart_open.smart_open('s3://mybucket/mykey.txt',
'wb') as fout:
... for line in ['first line', 'second line', 'third line']:
... fout.write(line + 'n')

Write more than 5GB to S3: multipart-ing
>>> mp =
b.initiate_multipart_upload(os.path.basename(source_path))
# Use a chunk size of 50 MiB (feel free to change this)
>>> chunk_size = 52428800
>>> chunk_count = int(math.ceil(source_size /
float(chunk_size)))
# Send the file parts, using FileChunkIO to create a file-like
object
# that points to a certain byte range within the original file. We
# set bytes to never exceed the original file size.
>>> for i in range(chunk_count):
>>> offset = chunk_size * i
>>> bytes = min(chunk_size, source_size - offset)
>>> with FileChunkIO(source_path, 'r', offset=offset,
bytes=bytes) as fp:
>>> mp.upload_part_from_file(fp, part_num=i + 1)
# Finish the upload
>>> mp.complete_upload()
#Note that if you forget to call either
mp.complete_upload() or mp.cancel_upload() you will
Boto:
>>> # stream content *into* S3 (write mode):
>>> with
smart_open.smart_open('s3://mybucket/mykey.txt', 'wb')
as fout:
... for line in ['first line', 'second line', 'third line']:
smart_open:
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
PEP 20. The zen of Python

From S3 to memory
>>> c = boto.connect_s3()
>>> b = c.get_bucket('mybucket')
>>> k = Key(b)
>>> k.key = 'foobar'
>>> # Create StringIO in RAM
>>> k.get_contents_as_string()
Traceback (most recent call last):
MemoryError
>>> # Workaround for memory error:
writing to local disk first. Need a large local
Boto:
>>> # can use context managers:
>>> with
smart_open.smart_open('s3://mybucket/myke
y.txt') as fin:
... for line in fin:
... print line
>>> # bonus
... fin.seek(0) # seek to the beginning
smart_open:

From large iterator to S3
>>> c = boto.connect_s3()
>>> b = c.get_bucket('mybucket')
>>> k = Key(b)
>>> k.key = 'foobar'
>>> k.set_contents_as_string(
list(my_iterator))
Traceback (most recent call last):
MemoryError
Boto:
>>> # stream content *into* S3 (write
mode):
>>> with
smart_open.smart_open('s3://mybucket/myke
y.txt', 'wb') as fout:
... for line in ['first line', 'second line', 'third
line']:
# Streamed input is uploaded in chunks, as
soon as `min_part_size` bytes are
accumulated
smart_open:

Un/Zipping line by line
>>> # stream from/to local compressed files:
>>> for line in smart_open.smart_open('./foo.txt.gz'):
... print line
>>> with
smart_open.smart_open('/home/radim/foo.txt.bz2', 'wb')
as fout:
... fout.write("some contentn")

Summary of Why?
Working with large S3 files using Amazon's default Python library, boto, is
a pain.
- limited by RAM. Its key.set_contents_from_string() and
key.get_contents_as_string() methods only work for small files
(loaded in RAM, no streaming).
- There are nasty hidden gotchas when using boto's multipart upload
functionality, and a lot of boilerplate.
smart_open shields you from that.
It builds on boto but offers a cleaner API.
The result is less code for you to write and fewer bugs to make.
- gzip ContextManager in Python 2.5 and 2.6

Streaming out-of-core read and write for:
- S3
- HDFS
- WebHDFS ( don’t have to use requests library!)
- local files.
- local compressed files
smart_open is not just for S3!

Thanks!
Lev Konstantinovskiy
github.com/tmylk
@teagermylk

smart_open at Data Science London meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to smart_open at Data Science London meetup

Similar to smart_open at Data Science London meetup (20)

Recently uploaded

Recently uploaded (20)

smart_open at Data Science London meetup