Home
Explore
Submit Search
Upload
Login
Signup
Advertisement
Ibis: Scaling Python Analytics on Hadoop and Impala
Report
Wes McKinney
Follow
Director of Ursa Labs, Open Source Developer at Ursa Labs
Oct. 1, 2015
•
0 likes
13 likes
×
Be the first to like this
Show More
•
7,597 views
views
×
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Check these out next
DataFrames: The Good, Bad, and Ugly
Wes McKinney
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera, Inc.
Machine Learning Loves Hadoop
Cloudera, Inc.
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
Apache Arrow and Python: The latest
Wes McKinney
DataFrames: The Extended Cut
Wes McKinney
1
of
33
Top clipped slide
Ibis: Scaling Python Analytics on Hadoop and Impala
Oct. 1, 2015
•
0 likes
13 likes
×
Be the first to like this
Show More
•
7,597 views
views
×
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Download Now
Download to read offline
Report
Technology
Delivered at Strata-Hadoop World in NYC on September 30, 2015
Wes McKinney
Follow
Director of Ursa Labs, Open Source Developer at Ursa Labs
Advertisement
Advertisement
Advertisement
Recommended
PyData: The Next Generation
Wes McKinney
22.2K views
•
31 slides
Data Tools and the Data Scientist Shortage
Wes McKinney
3.7K views
•
22 slides
Impala use case @ Zoosk
Cloudera, Inc.
2.1K views
•
13 slides
Ibis: Scaling the Python Data Experience
Wes McKinney
3.8K views
•
13 slides
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
7.4K views
•
37 slides
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
12.9K views
•
22 slides
More Related Content
Slideshows for you
(20)
DataFrames: The Good, Bad, and Ugly
Wes McKinney
•
12.9K views
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
•
5.4K views
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera, Inc.
•
982 views
Machine Learning Loves Hadoop
Cloudera, Inc.
•
9.1K views
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
•
3.5K views
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
•
1.3K views
Apache Arrow and Python: The latest
Wes McKinney
•
5.8K views
DataFrames: The Extended Cut
Wes McKinney
•
8.5K views
Data Science and CDSW
Jason Hubbard
•
1.3K views
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
•
1.7K views
Accelerate Your Big Data Analytics Efforts with SAS and Hadoop
DataWorks Summit
•
1.1K views
Emerging trends in data analytics
Wei-Chiu Chuang
•
1.2K views
Concur Discovers the True Value of Data
Cloudera, Inc.
•
3K views
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
DataWorks Summit
•
2.1K views
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
DataWorks Summit
•
4.5K views
Uber's data science workbench
Ran Wei
•
3.2K views
Solr consistency and recovery internals
Cloudera, Inc.
•
2.1K views
Cloudbreak - Technical Deep Dive
DataWorks Summit/Hadoop Summit
•
1.7K views
Hadoop vs. RDBMS for Advanced Analytics
joshwills
•
4K views
Hadoop Innovation Summit 2014
Data Con LA
•
2.2K views
Similar to Ibis: Scaling Python Analytics on Hadoop and Impala
(20)
Data Science in the Enterprise
The Hive
•
1.2K views
Part 3: Models in Production: A Look From Beginning to End
Cloudera, Inc.
•
1.9K views
Analyzing Hadoop Data Using Sparklyr
Cloudera, Inc.
•
2.4K views
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
•
1.1K views
Experfy Online Course - Gain Competitive Advantage Using Microsoft Azure Data...
Experfy
•
235 views
Turn Data Into Actionable Insights - StampedeCon 2016
StampedeCon
•
1.9K views
Continuum Analytics and Python
Travis Oliphant
•
4.7K views
All data accessible to all my organization - Presentation at OW2con'19, June...
OW2
•
106 views
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Alberto Diaz Martin
•
451 views
Azure Machine Learning
Davide Mauri
•
487 views
Part 1: Introducing the Cloudera Data Science Workbench
Cloudera, Inc.
•
7.3K views
USQL Trivadis Azure Data Lake Event
Trivadis
•
461 views
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Kognitio
•
660 views
Cloudstack Japan - cloudstack, the best kept secret in the cloud
ShapeBlue
•
983 views
Seattle Scalability - Sept Meetup
clive boulton
•
417 views
Know thy logos
Vishal V
•
59 views
Intro to Machine Learning with H2O and AWS
Sri Ambati
•
11.2K views
Global AI Bootcamp Madrid - Azure Databricks
Alberto Diaz Martin
•
269 views
Enabling Python to be a Better Big Data Citizen
Wes McKinney
•
6K views
Webinar: Proofpoint, a pioneer in security-as-a-service protects people, info...
DataStax
•
637 views
Advertisement
More from Wes McKinney
(20)
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
•
995 views
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
•
1.1K views
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
•
1.4K views
New Directions for Apache Arrow
Wes McKinney
•
1.9K views
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
•
2.2K views
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
•
1.9K views
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
•
960 views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
•
2.1K views
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
•
1.4K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
•
2.5K views
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
•
3.5K views
Ursa Labs and Apache Arrow in 2019
Wes McKinney
•
4.1K views
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
•
1.1K views
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
•
2K views
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
•
6.6K views
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
•
2.9K views
Shared Infrastructure for Data Science
Wes McKinney
•
8.5K views
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
•
6.2K views
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
•
5.6K views
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
•
3.2K views
Recently uploaded
(20)
Good Regulatory Practices.pptx
Sudipta Roy
•
0 views
d9f0992b5cb6478fa0dfff092cccc2d2.pdf
ThnhNguynVn97
•
0 views
EU opportunities inputs for FIWARE summit.pptx
FIWARE
•
0 views
Antenna_Design__Measurements_Laboratory_Lectures.pdf
Fredrick Isingo
•
0 views
NLEM.pptx
Sudipta Roy
•
0 views
COC2_edited.docx
ConstancioMonreal1
•
0 views
Monark Company Culture.pdf
CalebBenedict4
•
0 views
Galaxy Calendar by Slidesgo.pptx
JorgeEnrique67
•
0 views
MVC.pptx
ssuserfd27a7
•
0 views
Google AI Hub Demystified.pdf
Supernova Media
•
0 views
Medical Termination of Pregnancy Act.pptx
Sudipta Roy
•
0 views
F5 State of App Strategy Report 2023.pdf
Zeppos Galanos, MSc.
•
0 views
Innovation to startup.pptx
ravikumark42
•
0 views
STKI Israeli Market Study 2023
Dr. Jimmy Schwarzkopf
•
0 views
Gastro Oesophagal Reflux Disease GERD.pptx
Sudipta Roy
•
0 views
iHubs _ DIHs_FGS_Plugg Sessions.pptx
FIWARE
•
0 views
PLG Assignment 1_Sourabh Pal.pptx
SourabhPAL42
•
0 views
Enterprise Application to Infrastructure Integration - SDN Apps
MiftakhZein1
•
0 views
Why should your startup outsource software development.pdf
MaryLogan11
•
0 views
Biological Neural Network.pptx
Abdul Rehman
•
0 views
Advertisement
Ibis: Scaling Python Analytics on Hadoop and Impala
1"©"Cloudera,"Inc."All"rights"reserved." Ibis:"Scaling"Python"Analy=cs" on"Hadoop"and"Impala" Wes"McKinney,"StrataDHadoop"World"NYC"2015D09D30" @wesmckinn"
2"©"Cloudera,"Inc."All"rights"reserved." Me" • R&D"at"Cloudera" • Serial"creator"of"structured"data"tools"/"user"interfaces" •
Mathema=cian"—"MIT"‘07" • “Professional"SQL"programmer”"2007D2010"(@"AQR)" • Created"pandas"(Python"library)"in"2008" • Wrote"bestseller"Python'for'Data'Analysis'2012" • Founder"of"DataPad" "
3"©"Cloudera,"Inc."All"rights"reserved." Python"is"popular…" • Python"has"become"a"standard"language"of"data"science" • Why"is"it"popular?" • Maximizes"produc=vity"for"data"engineers"and"data"scien=sts" • Build"robust"soeware"and"do"interac=ve"data"analysis"with"100%"Python"code"" • EasyDtoDlearn"and"makes"happy"and"produc=ve"data"teams"" • Large,"diverse"open"source"development"community" • Comprehensive"libraries:"data"wrangling,"ML,"visualiza=on,"etc." •
Main"use"case:"data"science"&"engineering"swiss"army"knife"on"smallDtoDmedium" size"data"
4"©"Cloudera,"Inc."All"rights"reserved." …but"Python"does"not"scale"today" • Python"ecosystem"confined"to"singleDnode"analysis" • Great"for"smaller"data"sets" • Requires"sampling"or"aggrega=ons"for"larger"data" • Distributed"tools"compromise"in"various"ways" • Extrac=ng"samples"or"aggrega=ons"for"larger"data"means:" • “Scales”"by"losing"more"fidelity" • Addi=onal"ETL"overhead"to"extract"samples/aggrega=ons" • Loss"of"produc=vity"with"mul=ple"languages,"tools,"etc" • Blocks"certain"analysis"and"use"cases"
5"©"Cloudera,"Inc."All"rights"reserved." Industry"Analy=cs" Scien=fic"Compu=ng" Heterogeneous"data" """"Flat"tables"and"JSON" Spark"/"MapReduce" SQL" DFSDfriendly"/"streaming"data"formats" More"physical"machines" Homogeneous"data" """"Mul=dimensional"arrays" HPC"tools" Linear"algebra" Scien=fic"data"formats" Fewer"physical"machines" Some"simplis=c"generaliza=ons"
6"©"Cloudera,"Inc."All"rights"reserved." Industry"Analy=cs" Scien=fic"Compu=ng" Heterogeneous"data" """"Flat"tables"and"JSON" Spark"/"MapReduce" SQL" DFSDfriendly"/"streaming"data"formats" More"physical"machines" Homogeneous"data" """"Mul=dimensional"arrays" HPC"tools" Linear"algebra" Scien=fic"data"formats"(e.g."HDF5)" Fewer"physical"machines" Some"simplis=c"generaliza=ons" Python:(heavy(investment,(( generally( Python:(light(investment,( generally(
7"©"Cloudera,"Inc."All"rights"reserved." Industry"Analy=cs:"Python’s"existen=al"crisis"
8"©"Cloudera,"Inc."All"rights"reserved." Source:"Wikipedia"
9"©"Cloudera,"Inc."All"rights"reserved." Our"(Python’s)"biggest"mistake:" approaching"Big"Data"like"a" scien=fic"compu=ng"problem"
10"©"Cloudera,"Inc."All"rights"reserved." pandas" • Hugely"popular"Python"table"/"“data"frame”"library" • Labeled"table,"array,"and"=me"series"data"structures" • Popular"for"data"prepara=on,"ETL,"and"inDmemory"analy=cs" •
Built"using"Python’s"scien=fic"compu=ng"stack" • User"API"/"domain"specific"language" • Bespoke"inDmemory"analy=cs"/"rela=onal"algebra"engine" • IO"interfaces"(CSV,"SQL,"etc.)" • Expanded"data"type"system"(beyond"NumPy)" • Supports"flat"data"only"(or"semiDstructured"data"that"can"be"flasened)"
11"©"Cloudera,"Inc."All"rights"reserved." Many"SQL"engines" …"and"more"
12"©"Cloudera,"Inc."All"rights"reserved." The"“Great"Decoupling”"for"Big"Data" UI Ibis, SQL, Spark
API, … Compute Analytic SQL, Spark, MapReduce Storage HDFS, Kudu, HBase
13"©"Cloudera,"Inc."All"rights"reserved." A"sample"big"data"architecture" Kafka Kafka Kafka Kafka Application data HDFS JSON Spark/MapReduce Columnar storage Analytic
SQL Engine User SQL
14"©"Cloudera,"Inc."All"rights"reserved." Big"data"architectures"currently" dominated"by"JVM"languages,"with" a"increasing"amounts"of"C++" " Python/R/Julia"don’t"have"much"of" a"“seat"at"the"table”"
15"©"Cloudera,"Inc."All"rights"reserved." Nested"/"Complex"types"support" • Arrays,"structs,"maps,"and"unions"as"firstDclass"value"types" • Analyze"JSONDlike"data"directly"without"flasening"or"normaliza=on" •
Most"new"SQL"engines"have"some"level"of"support" • Impala" • Presto" • Drill" • BigQuery" • Spark"SQL" • Hive" • …"
16"©"Cloudera,"Inc."All"rights"reserved." Ibis"in"a"nutshell" • For"Python"programmers"doing"analy=cs"in"industry" • Project"Blog:"hsp://blog.ibisDproject.org" •
Joint"project"with"Impala"team"@"Cloudera" • ApacheDlicensed,"open"source"hsp://github.com/cloudera/ibis"" • Craeing"a"compelling"PythonDonDHadoop"user"experience" • Remove"SQL"coding"from"user"workflows" • Develop"high"performance"Python"extension"APIs"
17"©"Cloudera,"Inc."All"rights"reserved." Ibis"in"a"nutshell,"cont’d" • Composable"Python"DSL"(“Ibis"expressions”)"makes"handDcoding"SQL"SELECT" statements"unnecessary" • Ibis"for"SQL"Programmers:"hsp://docs.ibisDproject.org/sql.html" •
Development"roadmap"targets"Impala"(C++"/"LLVM)"query"engine" • …"but"SQL"compiler"toolchain"is"general"purpose" • Current"supports"Impala"and"SQLite,"but"soon"other"dialects" • We"welcome"external"contributors"for"other"Analy=c"SQL"engines"
18"©"Cloudera,"Inc."All"rights"reserved."
19"©"Cloudera,"Inc."All"rights"reserved." Benefits"of"Ibis" • Maximize"developer"produc=vity" • Mirrors"singleDnode"Python"experience" • Solve"big"data"problems"without"leaving"Python" • Leverage"Python"skills,"ecosystem,"and"tools" • Python"as"firstDclass"language"for"Hadoop" • FullDfidelity"analysis"without"extrac=ons" • Python"analysis"at"any"scale" • Na=ve"hardware"speeds"for"a"broad"set"of"use"cases"
20"©"Cloudera,"Inc."All"rights"reserved." Brief"interac=ve"demo"
21"©"Cloudera,"Inc."All"rights"reserved." Ibis/Impala"Joint"Roadmap" • More"natural"data"modeling" • Complex"types"support" • Integra=on"with"full"Python"data"ecosystem" • Advanced"analy=cs"+"machine"learning" • Enable"use"of"performance"compu=ng"tools" •
User"extensibility"with"na=ve"performance" • InDmemory"columnar"format" • PythonDtoDLLVM"IR"compila=on" • Workflow"and"usability"tools"
22"©"Cloudera,"Inc."All"rights"reserved." Execu=ng"data"science"languages"in"the"compute"layer" UI Ibis, SQL, Spark
API, … Compute Analytic SQL, Spark, MapReduce Storage HDFS, Kudu, HBase Python, R, Julia, …?
23"©"Cloudera,"Inc."All"rights"reserved." Enabling"interoperability"with"big"data"systems" • Distributed"/"MPP"query"engines:"implemented"in"a"host(language" • Typically"C/C++"or"Java/Scala" • UserDdefined"func=ons"(UDFs)"through"various"means" • Implement"in"host"language" • Implement"in"user"language"through"some"external"language"protocol"(oeen" RPCDbased)" •
External"UDFs"are"usually"very"slow"(cf:"PL/Python,"PySpark,"etc.)"
24"©"Cloudera,"Inc."All"rights"reserved." What"are"UDFs"good"for?" • Note:"industry"data"scien=sts"have"libraries"containing"100s"of"UDFs"for"Hive"or" other"distributed"query"engines" • Custom"data"transforma=ons" •
Custom"domain"logic"(date"/"=me"/"data"types)" • Custom"data"types" • Custom"aggrega=ons"(incl."machine"learning"/"sta=s=cs"expressible"as"reduc=ons)"
25"©"Cloudera,"Inc."All"rights"reserved." Why"are"external"UDFs"slow?" • Serializa=on"/"deserializa=on"overhead" • Scalar"vs"vectorized"computa=ons" •
RPC"overhead"
26"©"Cloudera,"Inc."All"rights"reserved." Example:"Vectoriza=on"for"interpreted"languages" SUM(CASE WHEN x
> y THEN x ELSE x + y END)
27"©"Cloudera,"Inc."All"rights"reserved." Vectorized"vs"Interpreted"perf"
28"©"Cloudera,"Inc."All"rights"reserved." How"to"make"them"fast?" • Common"run=me"memory"representa=on"for"tabular"data" • ShareDmemory"(zeroDcopy"or"memcpyDonly)"external"UDF"protocol" •
Vectorized"UDF"interface"(for"interpreted"languages)" • Impala"is"uniquely"posi=oned"to"play"well"with"Ibis" • BestDinDclass"performance"and"scalability" • C++"and"LLVMDbased"(JIT"compiler)"run=me" • Unified,"efficient"data"interchange"amongst"Ibis,"Impala,"and"Kudu"will"enable" high"performance"real"=me"analy=cs"from"Python"
29"©"Cloudera,"Inc."All"rights"reserved." Memory"representa=on" • Many"query"engines"are"standardizing"on"inDmemory"columnar"rep’n"of" materialized"transient"data" • Impala:" hsp://blog.cloudera.com/blog/2015/07/whatsDnextDforDimpalaDmoreD reliabilityDusabilityDandDperformanceDatDevenDgreaterDscale/" • Apache"Drill:"hsps://drill.apache.org/faq/" • IndustryDstandard"serializa=on"format:"Apache"Parquet" • hsps://parquet.apache.org/"
30"©"Cloudera,"Inc."All"rights"reserved." Serializa=on"vs"InDmemory" • Serializa=on"formats"(e.g."Parquet)"" • Op=mize"for"IO"/"DFS"throughput"at"expense"of"CPU/memory"bus"throughput" • Do"not"consider"random"access"or"inDmemory"analy=cs"as"a"goal" • No"standardized"inDmemory"containers"for"materialized"data"from"file"/"RPC" protocols"(Parquet,"Thrie,"protobuf,"Avro,"etc.)"
31"©"Cloudera,"Inc."All"rights"reserved." Standardized"inDmemory"columnar"(IMC)" • Compact"inDmemory"representa=on"for"semistructured"data" • Part"of"Impala’s"upcoming"dev"roadmap" •
Some"prior"IMCDforDSQL"work:"Apache"Drill" • Standardized"memory"representa=on"means"data"can"be"shared"without" serializa=on" • Create"a"canonical"C/C++"implementa=on"for"use"in"Python"/"R"/"Julia"
32"©"Cloudera,"Inc."All"rights"reserved." Ibis’s"Vision" • Uncompromised"Python"experience" • 100%"Python"endDtoDend"user"workflows"" • Enable"integra=on"with"the"exis=ng"Python"data"ecosystem"(pandas,"scikitD learn,"NumPy,"etc)" • Interac=ve"at"big"data"scale" • FullDfidelity"analysis"without"extrac=ons" • Scalability"for"big"data" • Na=ve"hardware"speeds"for"a"broad"set"of"use"cases"
33"©"Cloudera,"Inc."All"rights"reserved." Thank"you" Wes"McKinney"@wesmckinn" Views"are"my"own"
Advertisement