SlideShare a Scribd company logo
1 of 16
smart_open
Streaming large files with a simple Pythonic API to and from
S3, HDFS, WebHDFS, even zip and local files
Lev Konstantinovskiy
What?
smart_open is
a Python 2 and 3 library
for efficient streaming of very large files
with a simple Pythonic API
in 600 lines of code.
Easily switch just the path when data are moved for example from laptop
to S3.
smart_open.smart_open('./foo.txt')
smart_open.smart_open('./foo.txt.gz')
smart_open.smart_open('s3://mybucket/mykey.txt')
smart_open.smart_open('hdfs://user/hadoop/my_file.txt
Who?
Open-source MIT License. Maintained by RaRe
Technologies. Headed by Radim Rehurek aka
piskvorky.
Why?
- Originally part of gensim - an out-of-core
open-source text processing library (word2vec,
LDA etc). smart_open is used for streaming
large text corpora.
Why?
Boto is not Pythonistic :(
- Study 15 pages of boto book
before using S3
Solution:
smart_open is Pythonised boto
What is “Pythonistic”?
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
PEP 20. The zen of Python
Write more than 5GB to S3: multipart-ing in Boto
>>> mp = b.initiate_multipart_upload(os.path.basename(source_path))
>>> chunk_size = 52428800; chunk_count = int(math.ceil(source_size /
float(chunk_size))) # Use a chunk size of 50 MiB
# Send the file parts, using FileChunkIO to create a file-like object
# that points to a certain byte range within the original file. We
# set bytes to never exceed the original file size.
>>> for i in range(chunk_count):
>>> offset = chunk_size * i
>>> bytes = min(chunk_size, source_size - offset)
>>> with FileChunkIO(source_path, 'r', offset=offset,
bytes=bytes) as fp:
>>> mp.upload_part_from_file(fp, part_num=i + 1)
# Finish the upload
>>> mp.complete_upload()
#Note that if you forget to
call either
mp.complete_upload() or
mp.cancel_upload() you will
be left with an incomplete
upload and charged for the
storage consumed by the
uploaded parts. A call to
bucket.get_all_multipart_upl
oads() can help to show lost
multipart upload parts.
Write more than 5GB to S3: multipart-ing in
smart_open
>>> # stream content *into* S3 (write mode,
multiparting behind the screen):
>>> with
smart_open.smart_open('s3://mybucket/mykey.txt',
'wb') as fout:
... for line in ['first line', 'second line', 'third line']:
... fout.write(line + 'n')
Write more than 5GB to S3: multipart-ing
>>> mp =
b.initiate_multipart_upload(os.path.basename(source_path))
# Use a chunk size of 50 MiB (feel free to change this)
>>> chunk_size = 52428800
>>> chunk_count = int(math.ceil(source_size /
float(chunk_size)))
# Send the file parts, using FileChunkIO to create a file-like
object
# that points to a certain byte range within the original file. We
# set bytes to never exceed the original file size.
>>> for i in range(chunk_count):
>>> offset = chunk_size * i
>>> bytes = min(chunk_size, source_size - offset)
>>> with FileChunkIO(source_path, 'r', offset=offset,
bytes=bytes) as fp:
>>> mp.upload_part_from_file(fp, part_num=i + 1)
# Finish the upload
>>> mp.complete_upload()
#Note that if you forget to call either
mp.complete_upload() or mp.cancel_upload() you will
Boto:
>>> # stream content *into* S3 (write mode):
>>> with
smart_open.smart_open('s3://mybucket/mykey.txt', 'wb')
as fout:
... for line in ['first line', 'second line', 'third line']:
... fout.write(line + 'n')
smart_open:
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
PEP 20. The zen of Python
From S3 to memory
>>> c = boto.connect_s3()
>>> b = c.get_bucket('mybucket')
>>> k = Key(b)
>>> k.key = 'foobar'
>>> # Create StringIO in RAM
>>> k.get_contents_as_string()
Traceback (most recent call last):
MemoryError
>>> # Workaround for memory error:
writing to local disk first. Need a large local
Boto:
>>> # can use context managers:
>>> with
smart_open.smart_open('s3://mybucket/myke
y.txt') as fin:
... for line in fin:
... print line
>>> # bonus
... fin.seek(0) # seek to the beginning
smart_open:
From large iterator to S3
>>> c = boto.connect_s3()
>>> b = c.get_bucket('mybucket')
>>> k = Key(b)
>>> k.key = 'foobar'
>>> k.set_contents_as_string(
list(my_iterator))
Traceback (most recent call last):
MemoryError
Boto:
>>> # stream content *into* S3 (write
mode):
>>> with
smart_open.smart_open('s3://mybucket/myke
y.txt', 'wb') as fout:
... for line in ['first line', 'second line', 'third
line']:
... fout.write(line + 'n')
# Streamed input is uploaded in chunks, as
soon as `min_part_size` bytes are
accumulated
smart_open:
Un/Zipping line by line
>>> # stream from/to local compressed files:
>>> for line in smart_open.smart_open('./foo.txt.gz'):
... print line
>>> with
smart_open.smart_open('/home/radim/foo.txt.bz2', 'wb')
as fout:
... fout.write("some contentn")
Summary of Why?
Working with large S3 files using Amazon's default Python library, boto, is
a pain.
- limited by RAM. Its key.set_contents_from_string() and
key.get_contents_as_string() methods only work for small files
(loaded in RAM, no streaming).
- There are nasty hidden gotchas when using boto's multipart upload
functionality, and a lot of boilerplate.
smart_open shields you from that.
It builds on boto but offers a cleaner API.
The result is less code for you to write and fewer bugs to make.
- gzip ContextManager in Python 2.5 and 2.6
Streaming out-of-core read and write for:
- S3
- HDFS
- WebHDFS ( don’t have to use requests library!)
- local files.
- local compressed files
smart_open is not just for S3!
Thanks!
Lev Konstantinovskiy
github.com/tmylk
@teagermylk

More Related Content

What's hot

Basic linux commands for bioinformatics
Basic linux commands for bioinformaticsBasic linux commands for bioinformatics
Basic linux commands for bioinformaticsBonnie Ng
 
Top 10 Random Linux/Ubuntu Commands
Top 10 Random Linux/Ubuntu CommandsTop 10 Random Linux/Ubuntu Commands
Top 10 Random Linux/Ubuntu CommandsYusuf Felly
 
Linux Network commands
Linux Network commandsLinux Network commands
Linux Network commandsHanan Nmr
 
List command linux fidora
List command linux fidoraList command linux fidora
List command linux fidoraJinyuan Loh
 
The Art of Command Line (2021)
The Art of Command Line (2021)The Art of Command Line (2021)
The Art of Command Line (2021)Kenta Yamamoto
 
linux_Commads
linux_Commadslinux_Commads
linux_Commadstastedone
 
Terminal linux commands_ Fedora based
Terminal  linux commands_ Fedora basedTerminal  linux commands_ Fedora based
Terminal linux commands_ Fedora basedNavin Thapa
 
Character_Device_drvier_pc
Character_Device_drvier_pcCharacter_Device_drvier_pc
Character_Device_drvier_pcRashila Rr
 
Git for beginners
Git for beginnersGit for beginners
Git for beginnersVinh Nguyen
 
Linux Commands - Cheat Sheet
Linux Commands - Cheat Sheet Linux Commands - Cheat Sheet
Linux Commands - Cheat Sheet Isham Rashik
 
Web server working
Web server workingWeb server working
Web server workingPrem Joshua
 
DNS – Domain Name Service
DNS – Domain Name ServiceDNS – Domain Name Service
DNS – Domain Name ServiceJohnny Fortune
 
Pipes and filters
Pipes and filtersPipes and filters
Pipes and filtersbhatvijetha
 
Linux basic commands with examples
Linux basic commands with examplesLinux basic commands with examples
Linux basic commands with examplesabclearnn
 
Linux Basic commands and VI Editor
Linux Basic commands and VI EditorLinux Basic commands and VI Editor
Linux Basic commands and VI Editorshanmuga rajan
 
Puppet barcampexercises.jzt
Puppet barcampexercises.jztPuppet barcampexercises.jzt
Puppet barcampexercises.jztsom_nangia
 
Unix and Linux - The simple introduction
Unix and Linux - The simple introductionUnix and Linux - The simple introduction
Unix and Linux - The simple introductionAmity University Noida
 

What's hot (20)

Basic linux commands for bioinformatics
Basic linux commands for bioinformaticsBasic linux commands for bioinformatics
Basic linux commands for bioinformatics
 
Top 10 Random Linux/Ubuntu Commands
Top 10 Random Linux/Ubuntu CommandsTop 10 Random Linux/Ubuntu Commands
Top 10 Random Linux/Ubuntu Commands
 
Linux Network commands
Linux Network commandsLinux Network commands
Linux Network commands
 
List command linux fidora
List command linux fidoraList command linux fidora
List command linux fidora
 
The Art of Command Line (2021)
The Art of Command Line (2021)The Art of Command Line (2021)
The Art of Command Line (2021)
 
Vi Editor
Vi EditorVi Editor
Vi Editor
 
linux_Commads
linux_Commadslinux_Commads
linux_Commads
 
Terminal linux commands_ Fedora based
Terminal  linux commands_ Fedora basedTerminal  linux commands_ Fedora based
Terminal linux commands_ Fedora based
 
Character_Device_drvier_pc
Character_Device_drvier_pcCharacter_Device_drvier_pc
Character_Device_drvier_pc
 
Git for beginners
Git for beginnersGit for beginners
Git for beginners
 
Basic linux day 3
Basic linux day 3Basic linux day 3
Basic linux day 3
 
Linux Commands - Cheat Sheet
Linux Commands - Cheat Sheet Linux Commands - Cheat Sheet
Linux Commands - Cheat Sheet
 
Web server working
Web server workingWeb server working
Web server working
 
DNS – Domain Name Service
DNS – Domain Name ServiceDNS – Domain Name Service
DNS – Domain Name Service
 
Pipes and filters
Pipes and filtersPipes and filters
Pipes and filters
 
Linux basic commands with examples
Linux basic commands with examplesLinux basic commands with examples
Linux basic commands with examples
 
Partyhack 3.0 - Telegram bugbounty writeup
Partyhack 3.0 - Telegram bugbounty writeupPartyhack 3.0 - Telegram bugbounty writeup
Partyhack 3.0 - Telegram bugbounty writeup
 
Linux Basic commands and VI Editor
Linux Basic commands and VI EditorLinux Basic commands and VI Editor
Linux Basic commands and VI Editor
 
Puppet barcampexercises.jzt
Puppet barcampexercises.jztPuppet barcampexercises.jzt
Puppet barcampexercises.jzt
 
Unix and Linux - The simple introduction
Unix and Linux - The simple introductionUnix and Linux - The simple introduction
Unix and Linux - The simple introduction
 

Viewers also liked

Rivera montse presentacio_competic2
Rivera montse presentacio_competic2Rivera montse presentacio_competic2
Rivera montse presentacio_competic2insarivera
 
Top 8 building coordinator resume samples
Top 8 building coordinator resume samplesTop 8 building coordinator resume samples
Top 8 building coordinator resume samplesowenrodriguez458
 
Gana Khaidav -May-2015
Gana Khaidav -May-2015Gana Khaidav -May-2015
Gana Khaidav -May-2015Gana Khaidav
 
Artese - Social network e sanzioni disciplinari
Artese - Social network e sanzioni disciplinariArtese - Social network e sanzioni disciplinari
Artese - Social network e sanzioni disciplinariEdoardo E. Artese
 
Employment Schedule Changes to salary
Employment Schedule Changes to salaryEmployment Schedule Changes to salary
Employment Schedule Changes to salaryMark Jones
 
Tugas5 1300631007
Tugas5 1300631007Tugas5 1300631007
Tugas5 1300631007zamroni111
 

Viewers also liked (7)

Rivera montse presentacio_competic2
Rivera montse presentacio_competic2Rivera montse presentacio_competic2
Rivera montse presentacio_competic2
 
Top 8 building coordinator resume samples
Top 8 building coordinator resume samplesTop 8 building coordinator resume samples
Top 8 building coordinator resume samples
 
Mideesh23june2015
Mideesh23june2015Mideesh23june2015
Mideesh23june2015
 
Gana Khaidav -May-2015
Gana Khaidav -May-2015Gana Khaidav -May-2015
Gana Khaidav -May-2015
 
Artese - Social network e sanzioni disciplinari
Artese - Social network e sanzioni disciplinariArtese - Social network e sanzioni disciplinari
Artese - Social network e sanzioni disciplinari
 
Employment Schedule Changes to salary
Employment Schedule Changes to salaryEmployment Schedule Changes to salary
Employment Schedule Changes to salary
 
Tugas5 1300631007
Tugas5 1300631007Tugas5 1300631007
Tugas5 1300631007
 

Similar to smart_open at Data Science London meetup

Will iPython replace Bash?
Will iPython replace Bash?Will iPython replace Bash?
Will iPython replace Bash?Babel
 
Will iPython replace bash?
Will iPython replace bash?Will iPython replace bash?
Will iPython replace bash?Roberto Polli
 
Exploring Boto3 Events With Mitmproxy
Exploring Boto3 Events With MitmproxyExploring Boto3 Events With Mitmproxy
Exploring Boto3 Events With MitmproxyMichael Twomey
 
Building Bricks with MRuby: A Journey to MRuby on LEGO Robots
Building Bricks with MRuby: A Journey to MRuby on LEGO RobotsBuilding Bricks with MRuby: A Journey to MRuby on LEGO Robots
Building Bricks with MRuby: A Journey to MRuby on LEGO RobotsTorsten Schönebaum
 
Echtzeitapplikationen mit Elixir und GraphQL
Echtzeitapplikationen mit Elixir und GraphQLEchtzeitapplikationen mit Elixir und GraphQL
Echtzeitapplikationen mit Elixir und GraphQLMoritz Flucht
 
Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)Igalia
 
Web2py Code Lab
Web2py Code LabWeb2py Code Lab
Web2py Code LabColin Su
 
Class 1: Welcome to programming
Class 1: Welcome to programmingClass 1: Welcome to programming
Class 1: Welcome to programmingMarc Gouw
 
Language Sleuthing HOWTO with NLTK
Language Sleuthing HOWTO with NLTKLanguage Sleuthing HOWTO with NLTK
Language Sleuthing HOWTO with NLTKBrianna Laugher
 
Hopping in clouds: a tale of migration from one cloud provider to another
Hopping in clouds: a tale of migration from one cloud provider to anotherHopping in clouds: a tale of migration from one cloud provider to another
Hopping in clouds: a tale of migration from one cloud provider to anotherMichele Orselli
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxtidwellveronique
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxtidwellveronique
 
Project descriptionIn this Phase, we will develop two nodes.docx
Project descriptionIn this Phase, we will develop two nodes.docxProject descriptionIn this Phase, we will develop two nodes.docx
Project descriptionIn this Phase, we will develop two nodes.docxwkyra78
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filtersAcácio Oliveira
 
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...apidays
 
ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata EvonCanales257
 
Cis 170 c ilab 7 of 7 sequential files
Cis 170 c ilab 7 of 7 sequential filesCis 170 c ilab 7 of 7 sequential files
Cis 170 c ilab 7 of 7 sequential filesCIS321
 

Similar to smart_open at Data Science London meetup (20)

Will iPython replace Bash?
Will iPython replace Bash?Will iPython replace Bash?
Will iPython replace Bash?
 
Will iPython replace bash?
Will iPython replace bash?Will iPython replace bash?
Will iPython replace bash?
 
Exploring Boto3 Events With Mitmproxy
Exploring Boto3 Events With MitmproxyExploring Boto3 Events With Mitmproxy
Exploring Boto3 Events With Mitmproxy
 
Building Bricks with MRuby: A Journey to MRuby on LEGO Robots
Building Bricks with MRuby: A Journey to MRuby on LEGO RobotsBuilding Bricks with MRuby: A Journey to MRuby on LEGO Robots
Building Bricks with MRuby: A Journey to MRuby on LEGO Robots
 
Echtzeitapplikationen mit Elixir und GraphQL
Echtzeitapplikationen mit Elixir und GraphQLEchtzeitapplikationen mit Elixir und GraphQL
Echtzeitapplikationen mit Elixir und GraphQL
 
Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)Sandboxing WebKitGTK (GUADEC 2019)
Sandboxing WebKitGTK (GUADEC 2019)
 
Web2py Code Lab
Web2py Code LabWeb2py Code Lab
Web2py Code Lab
 
htogcp.docx
htogcp.docxhtogcp.docx
htogcp.docx
 
Computer Security
Computer SecurityComputer Security
Computer Security
 
Class 1: Welcome to programming
Class 1: Welcome to programmingClass 1: Welcome to programming
Class 1: Welcome to programming
 
Language Sleuthing HOWTO with NLTK
Language Sleuthing HOWTO with NLTKLanguage Sleuthing HOWTO with NLTK
Language Sleuthing HOWTO with NLTK
 
Hopping in clouds: a tale of migration from one cloud provider to another
Hopping in clouds: a tale of migration from one cloud provider to anotherHopping in clouds: a tale of migration from one cloud provider to another
Hopping in clouds: a tale of migration from one cloud provider to another
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docx
 
Project descriptionIn this Phase, we will develop two nodes.docx
Project descriptionIn this Phase, we will develop two nodes.docxProject descriptionIn this Phase, we will develop two nodes.docx
Project descriptionIn this Phase, we will develop two nodes.docx
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filters
 
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
 
ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata ECECS 472572 Final Exam ProjectRemember to check the errata
ECECS 472572 Final Exam ProjectRemember to check the errata
 
Cis 170 c ilab 7 of 7 sequential files
Cis 170 c ilab 7 of 7 sequential filesCis 170 c ilab 7 of 7 sequential files
Cis 170 c ilab 7 of 7 sequential files
 
One-Man Ops
One-Man OpsOne-Man Ops
One-Man Ops
 

Recently uploaded

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 

Recently uploaded (20)

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 

smart_open at Data Science London meetup

  • 1. smart_open Streaming large files with a simple Pythonic API to and from S3, HDFS, WebHDFS, even zip and local files Lev Konstantinovskiy
  • 2. What? smart_open is a Python 2 and 3 library for efficient streaming of very large files with a simple Pythonic API in 600 lines of code.
  • 3. Easily switch just the path when data are moved for example from laptop to S3. smart_open.smart_open('./foo.txt') smart_open.smart_open('./foo.txt.gz') smart_open.smart_open('s3://mybucket/mykey.txt') smart_open.smart_open('hdfs://user/hadoop/my_file.txt
  • 4. Who? Open-source MIT License. Maintained by RaRe Technologies. Headed by Radim Rehurek aka piskvorky.
  • 5. Why? - Originally part of gensim - an out-of-core open-source text processing library (word2vec, LDA etc). smart_open is used for streaming large text corpora.
  • 6. Why? Boto is not Pythonistic :( - Study 15 pages of boto book before using S3 Solution: smart_open is Pythonised boto
  • 7. What is “Pythonistic”? Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. PEP 20. The zen of Python
  • 8. Write more than 5GB to S3: multipart-ing in Boto >>> mp = b.initiate_multipart_upload(os.path.basename(source_path)) >>> chunk_size = 52428800; chunk_count = int(math.ceil(source_size / float(chunk_size))) # Use a chunk size of 50 MiB # Send the file parts, using FileChunkIO to create a file-like object # that points to a certain byte range within the original file. We # set bytes to never exceed the original file size. >>> for i in range(chunk_count): >>> offset = chunk_size * i >>> bytes = min(chunk_size, source_size - offset) >>> with FileChunkIO(source_path, 'r', offset=offset, bytes=bytes) as fp: >>> mp.upload_part_from_file(fp, part_num=i + 1) # Finish the upload >>> mp.complete_upload() #Note that if you forget to call either mp.complete_upload() or mp.cancel_upload() you will be left with an incomplete upload and charged for the storage consumed by the uploaded parts. A call to bucket.get_all_multipart_upl oads() can help to show lost multipart upload parts.
  • 9. Write more than 5GB to S3: multipart-ing in smart_open >>> # stream content *into* S3 (write mode, multiparting behind the screen): >>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout: ... for line in ['first line', 'second line', 'third line']: ... fout.write(line + 'n')
  • 10. Write more than 5GB to S3: multipart-ing >>> mp = b.initiate_multipart_upload(os.path.basename(source_path)) # Use a chunk size of 50 MiB (feel free to change this) >>> chunk_size = 52428800 >>> chunk_count = int(math.ceil(source_size / float(chunk_size))) # Send the file parts, using FileChunkIO to create a file-like object # that points to a certain byte range within the original file. We # set bytes to never exceed the original file size. >>> for i in range(chunk_count): >>> offset = chunk_size * i >>> bytes = min(chunk_size, source_size - offset) >>> with FileChunkIO(source_path, 'r', offset=offset, bytes=bytes) as fp: >>> mp.upload_part_from_file(fp, part_num=i + 1) # Finish the upload >>> mp.complete_upload() #Note that if you forget to call either mp.complete_upload() or mp.cancel_upload() you will Boto: >>> # stream content *into* S3 (write mode): >>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout: ... for line in ['first line', 'second line', 'third line']: ... fout.write(line + 'n') smart_open: Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. PEP 20. The zen of Python
  • 11. From S3 to memory >>> c = boto.connect_s3() >>> b = c.get_bucket('mybucket') >>> k = Key(b) >>> k.key = 'foobar' >>> # Create StringIO in RAM >>> k.get_contents_as_string() Traceback (most recent call last): MemoryError >>> # Workaround for memory error: writing to local disk first. Need a large local Boto: >>> # can use context managers: >>> with smart_open.smart_open('s3://mybucket/myke y.txt') as fin: ... for line in fin: ... print line >>> # bonus ... fin.seek(0) # seek to the beginning smart_open:
  • 12. From large iterator to S3 >>> c = boto.connect_s3() >>> b = c.get_bucket('mybucket') >>> k = Key(b) >>> k.key = 'foobar' >>> k.set_contents_as_string( list(my_iterator)) Traceback (most recent call last): MemoryError Boto: >>> # stream content *into* S3 (write mode): >>> with smart_open.smart_open('s3://mybucket/myke y.txt', 'wb') as fout: ... for line in ['first line', 'second line', 'third line']: ... fout.write(line + 'n') # Streamed input is uploaded in chunks, as soon as `min_part_size` bytes are accumulated smart_open:
  • 13. Un/Zipping line by line >>> # stream from/to local compressed files: >>> for line in smart_open.smart_open('./foo.txt.gz'): ... print line >>> with smart_open.smart_open('/home/radim/foo.txt.bz2', 'wb') as fout: ... fout.write("some contentn")
  • 14. Summary of Why? Working with large S3 files using Amazon's default Python library, boto, is a pain. - limited by RAM. Its key.set_contents_from_string() and key.get_contents_as_string() methods only work for small files (loaded in RAM, no streaming). - There are nasty hidden gotchas when using boto's multipart upload functionality, and a lot of boilerplate. smart_open shields you from that. It builds on boto but offers a cleaner API. The result is less code for you to write and fewer bugs to make. - gzip ContextManager in Python 2.5 and 2.6
  • 15. Streaming out-of-core read and write for: - S3 - HDFS - WebHDFS ( don’t have to use requests library!) - local files. - local compressed files smart_open is not just for S3!