SlideShare a Scribd company logo
1 of 16
smart_open
Streaming large files with a simple Pythonic API to and from
S3, HDFS, WebHDFS, even zip and local files
Lev Konstantinovskiy
What?
smart_open is
a Python 2 and 3 library
for efficient streaming of very large files
with a simple Pythonic API
in 600 lines of code.
Easily switch just the path when data are moved for example from laptop
to S3.
smart_open.smart_open('./foo.txt')
smart_open.smart_open('./foo.txt.gz')
smart_open.smart_open('s3://mybucket/mykey.txt')
smart_open.smart_open('hdfs://user/hadoop/my_file.txt
Who?
Open-source MIT License. Maintained by RaRe
Technologies. Headed by Radim Rehurek aka
piskvorky.
Why?
- Originally part of gensim - an out-of-core
open-source text processing library (word2vec,
LDA etc). smart_open is used for streaming
large text corpora.
Why?
Boto is not Pythonistic :(
- Study 15 pages of boto book
before using S3
Solution:
smart_open is Pythonised boto
What is “Pythonistic”?
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
PEP 20. The zen of Python
Write more than 5GB to S3: multipart-ing in Boto
>>> mp = b.initiate_multipart_upload(os.path.basename(source_path))
>>> chunk_size = 52428800; chunk_count = int(math.ceil(source_size /
float(chunk_size))) # Use a chunk size of 50 MiB
# Send the file parts, using FileChunkIO to create a file-like object
# that points to a certain byte range within the original file. We
# set bytes to never exceed the original file size.
>>> for i in range(chunk_count):
>>> offset = chunk_size * i
>>> bytes = min(chunk_size, source_size - offset)
>>> with FileChunkIO(source_path, 'r', offset=offset,
bytes=bytes) as fp:
>>> mp.upload_part_from_file(fp, part_num=i + 1)
# Finish the upload
>>> mp.complete_upload()
#Note that if you forget to
call either
mp.complete_upload() or
mp.cancel_upload() you will
be left with an incomplete
upload and charged for the
storage consumed by the
uploaded parts. A call to
bucket.get_all_multipart_upl
oads() can help to show lost
multipart upload parts.
Write more than 5GB to S3: multipart-ing in
smart_open
>>> # stream content *into* S3 (write mode,
multiparting behind the screen):
>>> with
smart_open.smart_open('s3://mybucket/mykey.txt',
'wb') as fout:
... for line in ['first line', 'second line', 'third line']:
... fout.write(line + 'n')
Write more than 5GB to S3: multipart-ing
>>> mp =
b.initiate_multipart_upload(os.path.basename(source_path))
# Use a chunk size of 50 MiB (feel free to change this)
>>> chunk_size = 52428800
>>> chunk_count = int(math.ceil(source_size /
float(chunk_size)))
# Send the file parts, using FileChunkIO to create a file-like
object
# that points to a certain byte range within the original file. We
# set bytes to never exceed the original file size.
>>> for i in range(chunk_count):
>>> offset = chunk_size * i
>>> bytes = min(chunk_size, source_size - offset)
>>> with FileChunkIO(source_path, 'r', offset=offset,
bytes=bytes) as fp:
>>> mp.upload_part_from_file(fp, part_num=i + 1)
# Finish the upload
>>> mp.complete_upload()
#Note that if you forget to call either
mp.complete_upload() or mp.cancel_upload() you will
Boto:
>>> # stream content *into* S3 (write mode):
>>> with
smart_open.smart_open('s3://mybucket/mykey.txt', 'wb')
as fout:
... for line in ['first line', 'second line', 'third line']:
... fout.write(line + 'n')
smart_open:
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
PEP 20. The zen of Python
From S3 to memory
>>> c = boto.connect_s3()
>>> b = c.get_bucket('mybucket')
>>> k = Key(b)
>>> k.key = 'foobar'
>>> # Create StringIO in RAM
>>> k.get_contents_as_string()
Traceback (most recent call last):
MemoryError
>>> # Workaround for memory error:
writing to local disk first. Need a large local
Boto:
>>> # can use context managers:
>>> with
smart_open.smart_open('s3://mybucket/myke
y.txt') as fin:
... for line in fin:
... print line
>>> # bonus
... fin.seek(0) # seek to the beginning
smart_open:
From large iterator to S3
>>> c = boto.connect_s3()
>>> b = c.get_bucket('mybucket')
>>> k = Key(b)
>>> k.key = 'foobar'
>>> k.set_contents_as_string(
list(my_iterator))
Traceback (most recent call last):
MemoryError
Boto:
>>> # stream content *into* S3 (write
mode):
>>> with
smart_open.smart_open('s3://mybucket/myke
y.txt', 'wb') as fout:
... for line in ['first line', 'second line', 'third
line']:
... fout.write(line + 'n')
# Streamed input is uploaded in chunks, as
soon as `min_part_size` bytes are
accumulated
smart_open:
Un/Zipping line by line
>>> # stream from/to local compressed files:
>>> for line in smart_open.smart_open('./foo.txt.gz'):
... print line
>>> with
smart_open.smart_open('/home/radim/foo.txt.bz2', 'wb')
as fout:
... fout.write("some contentn")
Summary of Why?
Working with large S3 files using Amazon's default Python library, boto, is
a pain.
- limited by RAM. Its key.set_contents_from_string() and
key.get_contents_as_string() methods only work for small files
(loaded in RAM, no streaming).
- There are nasty hidden gotchas when using boto's multipart upload
functionality, and a lot of boilerplate.
smart_open shields you from that.
It builds on boto but offers a cleaner API.
The result is less code for you to write and fewer bugs to make.
- gzip ContextManager in Python 2.5 and 2.6
Streaming out-of-core read and write for:
- S3
- HDFS
- WebHDFS ( don’t have to use requests library!)
- local files.
- local compressed files
smart_open is not just for S3!
Thanks!
Lev Konstantinovskiy
github.com/tmylk
@teagermylk

More Related Content

What's hot

Basic linux commands for bioinformatics
Basic linux commands for bioinformaticsBasic linux commands for bioinformatics
Basic linux commands for bioinformaticsBonnie Ng
 
Top 10 Random Linux/Ubuntu Commands
Top 10 Random Linux/Ubuntu CommandsTop 10 Random Linux/Ubuntu Commands
Top 10 Random Linux/Ubuntu CommandsYusuf Felly
 
Linux Network commands
Linux Network commandsLinux Network commands
Linux Network commandsHanan Nmr
 
List command linux fidora
List command linux fidoraList command linux fidora
List command linux fidoraJinyuan Loh
 
The Art of Command Line (2021)
The Art of Command Line (2021)The Art of Command Line (2021)
The Art of Command Line (2021)Kenta Yamamoto
 
linux_Commads
linux_Commadslinux_Commads
linux_Commadstastedone
 
Terminal linux commands_ Fedora based
Terminal  linux commands_ Fedora basedTerminal  linux commands_ Fedora based
Terminal linux commands_ Fedora basedNavin Thapa
 
Character_Device_drvier_pc
Character_Device_drvier_pcCharacter_Device_drvier_pc
Character_Device_drvier_pcRashila Rr
 
Git for beginners
Git for beginnersGit for beginners
Git for beginnersVinh Nguyen
 
Linux Commands - Cheat Sheet
Linux Commands - Cheat Sheet Linux Commands - Cheat Sheet
Linux Commands - Cheat Sheet Isham Rashik
 
Web server working
Web server workingWeb server working
Web server workingPrem Joshua
 
DNS – Domain Name Service
DNS – Domain Name ServiceDNS – Domain Name Service
DNS – Domain Name ServiceJohnny Fortune
 
Pipes and filters
Pipes and filtersPipes and filters
Pipes and filtersbhatvijetha
 
Linux basic commands with examples
Linux basic commands with examplesLinux basic commands with examples
Linux basic commands with examplesabclearnn
 
Linux Basic commands and VI Editor
Linux Basic commands and VI EditorLinux Basic commands and VI Editor
Linux Basic commands and VI Editorshanmuga rajan
 
Puppet barcampexercises.jzt
Puppet barcampexercises.jztPuppet barcampexercises.jzt
Puppet barcampexercises.jztsom_nangia
 
Unix and Linux - The simple introduction
Unix and Linux - The simple introductionUnix and Linux - The simple introduction
Unix and Linux - The simple introductionAmity University Noida
 

What's hot (20)

Basic linux commands for bioinformatics
Basic linux commands for bioinformaticsBasic linux commands for bioinformatics
Basic linux commands for bioinformatics
 
Top 10 Random Linux/Ubuntu Commands
Top 10 Random Linux/Ubuntu CommandsTop 10 Random Linux/Ubuntu Commands
Top 10 Random Linux/Ubuntu Commands
 
Linux Network commands
Linux Network commandsLinux Network commands
Linux Network commands
 
List command linux fidora
List command linux fidoraList command linux fidora
List command linux fidora
 
The Art of Command Line (2021)
The Art of Command Line (2021)The Art of Command Line (2021)
The Art of Command Line (2021)
 
Vi Editor
Vi EditorVi Editor
Vi Editor
 
linux_Commads
linux_Commadslinux_Commads
linux_Commads
 
Terminal linux commands_ Fedora based
Terminal  linux commands_ Fedora basedTerminal  linux commands_ Fedora based
Terminal linux commands_ Fedora based
 
Character_Device_drvier_pc
Character_Device_drvier_pcCharacter_Device_drvier_pc
Character_Device_drvier_pc
 
Git for beginners
Git for beginnersGit for beginners
Git for beginners
 
Basic linux day 3
Basic linux day 3Basic linux day 3
Basic linux day 3
 
Linux Commands - Cheat Sheet
Linux Commands - Cheat Sheet Linux Commands - Cheat Sheet
Linux Commands - Cheat Sheet
 
Web server working
Web server workingWeb server working
Web server working
 
DNS – Domain Name Service
DNS – Domain Name ServiceDNS – Domain Name Service
DNS – Domain Name Service
 
Pipes and filters
Pipes and filtersPipes and filters
Pipes and filters
 
Linux basic commands with examples
Linux basic commands with examplesLinux basic commands with examples
Linux basic commands with examples
 
Partyhack 3.0 - Telegram bugbounty writeup
Partyhack 3.0 - Telegram bugbounty writeupPartyhack 3.0 - Telegram bugbounty writeup
Partyhack 3.0 - Telegram bugbounty writeup
 
Linux Basic commands and VI Editor
Linux Basic commands and VI EditorLinux Basic commands and VI Editor
Linux Basic commands and VI Editor
 
Puppet barcampexercises.jzt
Puppet barcampexercises.jztPuppet barcampexercises.jzt
Puppet barcampexercises.jzt
 
Unix and Linux - The simple introduction
Unix and Linux - The simple introductionUnix and Linux - The simple introduction
Unix and Linux - The simple introduction
 

Viewers also liked

Rivera montse presentacio_competic2
Rivera montse presentacio_competic2Rivera montse presentacio_competic2
Rivera montse presentacio_competic2insarivera
 
Top 8 building coordinator resume samples
Top 8 building coordinator resume samplesTop 8 building coordinator resume samples
Top 8 building coordinator resume samplesowenrodriguez458
 
Gana Khaidav -May-2015
Gana Khaidav -May-2015Gana Khaidav -May-2015
Gana Khaidav -May-2015Gana Khaidav
 
Artese - Social network e sanzioni disciplinari
Artese - Social network e sanzioni disciplinariArtese - Social network e sanzioni disciplinari
Artese - Social network e sanzioni disciplinariEdoardo E. Artese
 
Employment Schedule Changes to salary
Employment Schedule Changes to salaryEmployment Schedule Changes to salary
Employment Schedule Changes to salaryMark Jones
 
Tugas5 1300631007
Tugas5 1300631007Tugas5 1300631007
Tugas5 1300631007zamroni111
 

Viewers also liked (7)

Rivera montse presentacio_competic2
Rivera montse presentacio_competic2Rivera montse presentacio_competic2
Rivera montse presentacio_competic2
 
Top 8 building coordinator resume samples
Top 8 building coordinator resume samplesTop 8 building coordinator resume samples
Top 8 building coordinator resume samples
 
Mideesh23june2015
Mideesh23june2015Mideesh23june2015
Mideesh23june2015
 
Gana Khaidav -May-2015
Gana Khaidav -May-2015Gana Khaidav -May-2015
Gana Khaidav -May-2015
 
Artese - Social network e sanzioni disciplinari
Artese - Social network e sanzioni disciplinariArtese - Social network e sanzioni disciplinari
Artese - Social network e sanzioni disciplinari
 
Employment Schedule Changes to salary
Employment Schedule Changes to salaryEmployment Schedule Changes to salary
Employment Schedule Changes to salary
 
Tugas5 1300631007
Tugas5 1300631007Tugas5 1300631007
Tugas5 1300631007
 

Similar to smart_open at Data Science London meetup

Will iPython replace Bash?
Will iPython replace Bash?Will iPython replace Bash?
Will iPython replace Bash?Babel
 
Will iPython replace bash?
Will iPython replace bash?Will iPython replace bash?
Will iPython replace bash?Roberto Polli
 
Exploring Boto3 Events With Mitmproxy
Exploring Boto3 Events With MitmproxyExploring Boto3 Events With Mitmproxy
Exploring Boto3 Events With MitmproxyMichael Twomey
 
Building Bricks with MRuby: A Journey to MRuby on LEGO Robots
Building Bricks with MRuby: A Journey to MRuby on LEGO RobotsBuilding Bricks with MRuby: A Journey to MRuby on LEGO Robots
Building Bricks with MRuby: A Journey to MRuby on LEGO RobotsTorsten Schönebaum
 
Echtzeitapplikationen mit Elixir und GraphQL
Echtzeitapplikationen mit Elixir und GraphQLEchtzeitapplikationen mit Elixir und GraphQL
Echtzeitapplikationen mit Elixir und GraphQLMoritz Flucht
 
Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)Igalia
 
Web2py Code Lab
Web2py Code LabWeb2py Code Lab
Web2py Code LabColin Su
 
Class 1: Welcome to programming
Class 1: Welcome to programmingClass 1: Welcome to programming
Class 1: Welcome to programmingMarc Gouw
 
Language Sleuthing HOWTO with NLTK
Language Sleuthing HOWTO with NLTKLanguage Sleuthing HOWTO with NLTK
Language Sleuthing HOWTO with NLTKBrianna Laugher
 
Hopping in clouds: a tale of migration from one cloud provider to another
Hopping in clouds: a tale of migration from one cloud provider to anotherHopping in clouds: a tale of migration from one cloud provider to another
Hopping in clouds: a tale of migration from one cloud provider to anotherMichele Orselli
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxtidwellveronique
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxtidwellveronique
 
Project descriptionIn this Phase, we will develop two nodes.docx
Project descriptionIn this Phase, we will develop two nodes.docxProject descriptionIn this Phase, we will develop two nodes.docx
Project descriptionIn this Phase, we will develop two nodes.docxwkyra78
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filtersAcácio Oliveira
 
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...apidays
 
ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata EvonCanales257
 
Cis 170 c ilab 7 of 7 sequential files
Cis 170 c ilab 7 of 7 sequential filesCis 170 c ilab 7 of 7 sequential files
Cis 170 c ilab 7 of 7 sequential filesCIS321
 

Similar to smart_open at Data Science London meetup (20)

Will iPython replace Bash?
Will iPython replace Bash?Will iPython replace Bash?
Will iPython replace Bash?
 
Will iPython replace bash?
Will iPython replace bash?Will iPython replace bash?
Will iPython replace bash?
 
Exploring Boto3 Events With Mitmproxy
Exploring Boto3 Events With MitmproxyExploring Boto3 Events With Mitmproxy
Exploring Boto3 Events With Mitmproxy
 
Building Bricks with MRuby: A Journey to MRuby on LEGO Robots
Building Bricks with MRuby: A Journey to MRuby on LEGO RobotsBuilding Bricks with MRuby: A Journey to MRuby on LEGO Robots
Building Bricks with MRuby: A Journey to MRuby on LEGO Robots
 
Echtzeitapplikationen mit Elixir und GraphQL
Echtzeitapplikationen mit Elixir und GraphQLEchtzeitapplikationen mit Elixir und GraphQL
Echtzeitapplikationen mit Elixir und GraphQL
 
Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)
 
Web2py Code Lab
Web2py Code LabWeb2py Code Lab
Web2py Code Lab
 
htogcp.docx
htogcp.docxhtogcp.docx
htogcp.docx
 
Computer Security
Computer SecurityComputer Security
Computer Security
 
Class 1: Welcome to programming
Class 1: Welcome to programmingClass 1: Welcome to programming
Class 1: Welcome to programming
 
Language Sleuthing HOWTO with NLTK
Language Sleuthing HOWTO with NLTKLanguage Sleuthing HOWTO with NLTK
Language Sleuthing HOWTO with NLTK
 
Hopping in clouds: a tale of migration from one cloud provider to another
Hopping in clouds: a tale of migration from one cloud provider to anotherHopping in clouds: a tale of migration from one cloud provider to another
Hopping in clouds: a tale of migration from one cloud provider to another
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docx
 
Project descriptionIn this Phase, we will develop two nodes.docx
Project descriptionIn this Phase, we will develop two nodes.docxProject descriptionIn this Phase, we will develop two nodes.docx
Project descriptionIn this Phase, we will develop two nodes.docx
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filters
 
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
 
ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata
 
Cis 170 c ilab 7 of 7 sequential files
Cis 170 c ilab 7 of 7 sequential filesCis 170 c ilab 7 of 7 sequential files
Cis 170 c ilab 7 of 7 sequential files
 
One-Man Ops
One-Man OpsOne-Man Ops
One-Man Ops
 

Recently uploaded

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 

Recently uploaded (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

smart_open at Data Science London meetup

  • 1. smart_open Streaming large files with a simple Pythonic API to and from S3, HDFS, WebHDFS, even zip and local files Lev Konstantinovskiy
  • 2. What? smart_open is a Python 2 and 3 library for efficient streaming of very large files with a simple Pythonic API in 600 lines of code.
  • 3. Easily switch just the path when data are moved for example from laptop to S3. smart_open.smart_open('./foo.txt') smart_open.smart_open('./foo.txt.gz') smart_open.smart_open('s3://mybucket/mykey.txt') smart_open.smart_open('hdfs://user/hadoop/my_file.txt
  • 4. Who? Open-source MIT License. Maintained by RaRe Technologies. Headed by Radim Rehurek aka piskvorky.
  • 5. Why? - Originally part of gensim - an out-of-core open-source text processing library (word2vec, LDA etc). smart_open is used for streaming large text corpora.
  • 6. Why? Boto is not Pythonistic :( - Study 15 pages of boto book before using S3 Solution: smart_open is Pythonised boto
  • 7. What is “Pythonistic”? Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. PEP 20. The zen of Python
  • 8. Write more than 5GB to S3: multipart-ing in Boto >>> mp = b.initiate_multipart_upload(os.path.basename(source_path)) >>> chunk_size = 52428800; chunk_count = int(math.ceil(source_size / float(chunk_size))) # Use a chunk size of 50 MiB # Send the file parts, using FileChunkIO to create a file-like object # that points to a certain byte range within the original file. We # set bytes to never exceed the original file size. >>> for i in range(chunk_count): >>> offset = chunk_size * i >>> bytes = min(chunk_size, source_size - offset) >>> with FileChunkIO(source_path, 'r', offset=offset, bytes=bytes) as fp: >>> mp.upload_part_from_file(fp, part_num=i + 1) # Finish the upload >>> mp.complete_upload() #Note that if you forget to call either mp.complete_upload() or mp.cancel_upload() you will be left with an incomplete upload and charged for the storage consumed by the uploaded parts. A call to bucket.get_all_multipart_upl oads() can help to show lost multipart upload parts.
  • 9. Write more than 5GB to S3: multipart-ing in smart_open >>> # stream content *into* S3 (write mode, multiparting behind the screen): >>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout: ... for line in ['first line', 'second line', 'third line']: ... fout.write(line + 'n')
  • 10. Write more than 5GB to S3: multipart-ing >>> mp = b.initiate_multipart_upload(os.path.basename(source_path)) # Use a chunk size of 50 MiB (feel free to change this) >>> chunk_size = 52428800 >>> chunk_count = int(math.ceil(source_size / float(chunk_size))) # Send the file parts, using FileChunkIO to create a file-like object # that points to a certain byte range within the original file. We # set bytes to never exceed the original file size. >>> for i in range(chunk_count): >>> offset = chunk_size * i >>> bytes = min(chunk_size, source_size - offset) >>> with FileChunkIO(source_path, 'r', offset=offset, bytes=bytes) as fp: >>> mp.upload_part_from_file(fp, part_num=i + 1) # Finish the upload >>> mp.complete_upload() #Note that if you forget to call either mp.complete_upload() or mp.cancel_upload() you will Boto: >>> # stream content *into* S3 (write mode): >>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout: ... for line in ['first line', 'second line', 'third line']: ... fout.write(line + 'n') smart_open: Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. PEP 20. The zen of Python
  • 11. From S3 to memory >>> c = boto.connect_s3() >>> b = c.get_bucket('mybucket') >>> k = Key(b) >>> k.key = 'foobar' >>> # Create StringIO in RAM >>> k.get_contents_as_string() Traceback (most recent call last): MemoryError >>> # Workaround for memory error: writing to local disk first. Need a large local Boto: >>> # can use context managers: >>> with smart_open.smart_open('s3://mybucket/myke y.txt') as fin: ... for line in fin: ... print line >>> # bonus ... fin.seek(0) # seek to the beginning smart_open:
  • 12. From large iterator to S3 >>> c = boto.connect_s3() >>> b = c.get_bucket('mybucket') >>> k = Key(b) >>> k.key = 'foobar' >>> k.set_contents_as_string( list(my_iterator)) Traceback (most recent call last): MemoryError Boto: >>> # stream content *into* S3 (write mode): >>> with smart_open.smart_open('s3://mybucket/myke y.txt', 'wb') as fout: ... for line in ['first line', 'second line', 'third line']: ... fout.write(line + 'n') # Streamed input is uploaded in chunks, as soon as `min_part_size` bytes are accumulated smart_open:
  • 13. Un/Zipping line by line >>> # stream from/to local compressed files: >>> for line in smart_open.smart_open('./foo.txt.gz'): ... print line >>> with smart_open.smart_open('/home/radim/foo.txt.bz2', 'wb') as fout: ... fout.write("some contentn")
  • 14. Summary of Why? Working with large S3 files using Amazon's default Python library, boto, is a pain. - limited by RAM. Its key.set_contents_from_string() and key.get_contents_as_string() methods only work for small files (loaded in RAM, no streaming). - There are nasty hidden gotchas when using boto's multipart upload functionality, and a lot of boilerplate. smart_open shields you from that. It builds on boto but offers a cleaner API. The result is less code for you to write and fewer bugs to make. - gzip ContextManager in Python 2.5 and 2.6
  • 15. Streaming out-of-core read and write for: - S3 - HDFS - WebHDFS ( don’t have to use requests library!) - local files. - local compressed files smart_open is not just for S3!