Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
High Performance Hadoop
with Python
Presenter Bio
Kristopher Overholt received his Ph.D. in Civil Engineering
from The University of Texas at Austin.
Prior to...
Presenter Bio
Matthew Rocklin received his Ph.D. in computer
science from the University of Chicago and is
currently emplo...
Overview
4
• Overview of Continuum and Anaconda
• Overview of Dask (Distributed Processing Framework)
• Example parallel w...
Overview of Continuum and Anaconda
The Platform to Accelerate, Connect & Empower
Continuum Analytics is the company behind Anaconda and offers:
6
is….
the le...
Bokeh
Founders
– Travis Oliphant, creator of NumPy and SciPy	
  
– Peter Wang, creator of Chaco & Bokeh visualization libr...
Financial	
  Services	
  
–	
  Risk	
  Mgmt.,	
  Quant	
  modeling,	
  Data	
  
exploration	
  and	
  processing,	
  algor...
Leading Open Data Science Platform
powered by Python
Quickly Engage w/ Your Data
9
• 720+ Popular Packages	
  
• Optimized...
10
Anaconda
Accelerating Adoption of Python for Enterprises
COLLABORATIVE NOTEBOOKS
with publication,authentication,& sear...
YARN
JVM
Bottom Line

10-100X faster performance
• Interact with data in HDFS and
Amazon S3 natively from Python
• Distrib...
Overview of Dask as a Distributed
Processing Framework
Overview of Dask
13
Dask is a Python parallel computing library that is:
• Familiar: Implements parallel NumPy and Pandas ...
Spectrum of Parallelization
14
Threads
Processes
MPI
ZeroMQ
Dask
Hadoop
Spark
SQL:
Hive
Pig
Impala
Implicit control: Restr...
Dask: From User Interaction to Execution
15
Dask Collections: Familiar Expressions and API
16
x.T - x.mean(axis=0)
df.groupby(df.index).value.mean()
def load(filename...
Dask Graphs: Example Machine Learning Pipeline
17
Dask Graphs: Example Machine Learning Pipeline + Grid Search
18
Scheduler
Worker
Worker
Worker
Worker
Client
Same network
User Machine (laptop)Client
Worker
Dask Schedulers: Example - Di...
Example Parallel Workflows with
Anaconda and Dask
Examples
21
Analyzing
NYC Taxi
CSV data using
distributed Dask
DataFrames
• Demonstrate
Pandas at scale
• Observe responsi...
Example 1: Using Dask DataFrames on a cluster with CSV data
22
• Built from Pandas DataFrames
• Match Pandas interface
• A...
Example 2: Using Dask Bags on a cluster with text data
23
• Distributed natural language processing
with text data stored ...
NumPy
Array
}
}Dask
Array
Example 3: Using Dask Arrays with global temperature data
24
• Built from NumPy

n-dimensional a...
Example 4: Using Dask Delayed to handle custom workflows
25
• Manually handle functions to support messy situations
• Life...
Precursors to Parallelism
26
• Consider the following approaches first:
1. Use better algorithms
2. Try Numba or C/Cython
...
Using Anaconda with Dask
Cluster Architecture Diagram
28
Client Machine Compute
Node
Compute
Node
Compute
Node
Head Node
• Single machine with multiple threads or processes
• On a cluster with SSH (dcluster)
• Resource management: YARN (knit),...
• Dynamically manage Python and conda environments across a cluster
• Works with enterprise Hadoop distributions and HPC c...
Cluster Deployment & Operations
Before Anaconda
for cluster management
Head Node
1. Manually install Python,
packages & de...
Admin
Edge Node
Compute Nodes
Using Dask and Anaconda Enterprise on your Cluster
32
Analyst Machine
Anaconda
Repository
Ha...
Analyst Machine
Anaconda
Repository
Hadoop or
HPC Cluster
Dask
Anaconda
Analyst ships packages and environments to on-prem...
Admin deploys conda packages and environments to cluster nodes2.
Using Dask and Anaconda Enterprise on your Cluster
34
Adm...
Analyst submits distributed jobs and utilizes Anaconda on the cluster3.
Using Dask and Anaconda Enterprise on your Cluster...
Service Architecture Diagram
36
HMS
CM
S
HS
NN
RM
JN
ID
SG
SNN
NM
DN
G
WHCS
HS2
ACH Anaconda Cluster Head
ACC
AR
CM G
AR A...
acluster conda install numpy scipy pandas numba
acluster conda create -n py34 python=3.4 numpy scipy pandas
acluster conda...
Cluster Management Commands
38
Create cluster
Install plugins
List active clusters
SSH to nodes
Put/get files
Run command
...
Solutions with Anaconda and Dask
• Open source foundational components
• Dask, Distributed scheduler, HDFS reading/writing,

YARN interoperability, S3 inte...
Application
Analytics
Data and

Resource Management
Server
Jupyter/IPython Notebook
pandas, NumPy, SciPy, Numba,
NLTK, sci...
DISTRIBUTED
42
Automatic & flexible visualization
of billions of points in real-time
Interactive querying, exploration
and...
Use Cases with Anaconda and Dask
43
• Runs on a single machine or 100s of nodes
• Works on cloud-based or bare-metal clust...
Solutions with Anaconda and Dask
44
• Architecture consulting and review
• Manage Python packages and environments on a cl...
Anaconda Subscriptions
45
Additional Resources
$ conda install anaconda-client
$ anaconda login
$ conda install anaconda-cluster -c anaconda-cluster
$ acluster create cl...
Contact Information and Additional Details
• Contact sales@continuum.io for information about Anaconda subscriptions,
cons...
Kristopher Overholt
Twitter: @koverholt
Matthew Rocklin
Twitter: @mrocklin
Thank you
49
Email: sales@continuum.io
Twitter:...
Upcoming SlideShare
Loading in …5
×

of

High Performance Hadoop with Python - Webinar Slide 1 High Performance Hadoop with Python - Webinar Slide 2 High Performance Hadoop with Python - Webinar Slide 3 High Performance Hadoop with Python - Webinar Slide 4 High Performance Hadoop with Python - Webinar Slide 5 High Performance Hadoop with Python - Webinar Slide 6 High Performance Hadoop with Python - Webinar Slide 7 High Performance Hadoop with Python - Webinar Slide 8 High Performance Hadoop with Python - Webinar Slide 9 High Performance Hadoop with Python - Webinar Slide 10 High Performance Hadoop with Python - Webinar Slide 11 High Performance Hadoop with Python - Webinar Slide 12 High Performance Hadoop with Python - Webinar Slide 13 High Performance Hadoop with Python - Webinar Slide 14 High Performance Hadoop with Python - Webinar Slide 15 High Performance Hadoop with Python - Webinar Slide 16 High Performance Hadoop with Python - Webinar Slide 17 High Performance Hadoop with Python - Webinar Slide 18 High Performance Hadoop with Python - Webinar Slide 19 High Performance Hadoop with Python - Webinar Slide 20 High Performance Hadoop with Python - Webinar Slide 21 High Performance Hadoop with Python - Webinar Slide 22 High Performance Hadoop with Python - Webinar Slide 23 High Performance Hadoop with Python - Webinar Slide 24 High Performance Hadoop with Python - Webinar Slide 25 High Performance Hadoop with Python - Webinar Slide 26 High Performance Hadoop with Python - Webinar Slide 27 High Performance Hadoop with Python - Webinar Slide 28 High Performance Hadoop with Python - Webinar Slide 29 High Performance Hadoop with Python - Webinar Slide 30 High Performance Hadoop with Python - Webinar Slide 31 High Performance Hadoop with Python - Webinar Slide 32 High Performance Hadoop with Python - Webinar Slide 33 High Performance Hadoop with Python - Webinar Slide 34 High Performance Hadoop with Python - Webinar Slide 35 High Performance Hadoop with Python - Webinar Slide 36 High Performance Hadoop with Python - Webinar Slide 37 High Performance Hadoop with Python - Webinar Slide 38 High Performance Hadoop with Python - Webinar Slide 39 High Performance Hadoop with Python - Webinar Slide 40 High Performance Hadoop with Python - Webinar Slide 41 High Performance Hadoop with Python - Webinar Slide 42 High Performance Hadoop with Python - Webinar Slide 43 High Performance Hadoop with Python - Webinar Slide 44 High Performance Hadoop with Python - Webinar Slide 45 High Performance Hadoop with Python - Webinar Slide 46 High Performance Hadoop with Python - Webinar Slide 47 High Performance Hadoop with Python - Webinar Slide 48 High Performance Hadoop with Python - Webinar Slide 49
Upcoming SlideShare
Open Data Science with R and Anaconda
Next
Download to read offline and view in fullscreen.

29 Likes

Share

Download to read offline

High Performance Hadoop with Python - Webinar

Download to read offline

Scale Up & Scale Out with Anaconda

Python is the fastest growing Open Data Science language & is used more than 50% of the time to extract value from Big Data in Spark.

However, both PySpark & SparkR involve JVM overhead and Python/Java serialization when interacting with Spark which negatively impacts the time-to-value from your Big Data. What if there was a way to leverage the entire Python ecosystem without refactoring your Hadoop-based data science investments & get high performance?

Anaconda, the leading Open Data Science Platform, delivers high performance Python for Hadoop. You get to leverage your existing Python-based data science investments with your existing Hadoop or HPC clusters. Anaconda bypasses the typical Hadoop performance issues, leverages existing high performance scientific and array-based computing in Python and now leverages Dask, the powerful parallel execution framework, to deliver fast results on any enterprise Hadoop distribution such as Cloudera & Hortonworks.

On April 13th, Dr. Kristopher Overholt & Dr. Matthew Rocklin of Continuum Analytics present a webinar on High Performance Hadoop with Python.

In this webinar, you'll learn to:
-Analyze NYC taxi data through distributed DataFrames on a cluster on HDFS
-Create interactive distributed visualizations of global temperature data
-Distribute in-memory natural language processing & interactive queries on text data in HDFS
-Wrap and parallelize existing legacy code on custom file formats

High Performance Hadoop with Python - Webinar

  1. 1. High Performance Hadoop with Python
  2. 2. Presenter Bio Kristopher Overholt received his Ph.D. in Civil Engineering from The University of Texas at Austin. Prior to joining Continuum, he worked at the
 National Institute of Standards and Technology (NIST),
 Southwest Research Institute (SwRI), and
 The University of Texas at Austin. Kristopher has 10+ years of experience in areas including applied research, scientific and parallel computing, system administration, open-source software development, and computational modeling. 2 Kristopher Overholt
 Solution Architect Continuum Analytics
  3. 3. Presenter Bio Matthew Rocklin received his Ph.D. in computer science from the University of Chicago and is currently employed at Continuum Analytics as a computational scientist. He is an active contributor to many open source projects in the PyData ecosystem and is the lead developer of Dask. 3 Matthew Rocklin
 Computational Scientist
 Continuum Analytics
  4. 4. Overview 4 • Overview of Continuum and Anaconda • Overview of Dask (Distributed Processing Framework) • Example parallel workflows with Anaconda and Dask • Distributed dataframes on a cluster with CSV data • Distributed natural language processing with text data • Analyzing array-based global temperature data • Parallelizing custom code and workflows • Using Anaconda with Dask • Solutions with Anaconda and Dask
  5. 5. Overview of Continuum and Anaconda
  6. 6. The Platform to Accelerate, Connect & Empower Continuum Analytics is the company behind Anaconda and offers: 6 is…. the leading open data science platform powered by Python the fastest growing open data science language • Consulting • Training • Open-Source Software • Enterprise Software
  7. 7. Bokeh Founders – Travis Oliphant, creator of NumPy and SciPy   – Peter Wang, creator of Chaco & Bokeh visualization libraries Engineers – Antoine Pitrou, Python core developer – Jeff Reback, Pandas maintainer and core developer – Carlos Cardoba, Spyder maintainer and core developer – Damian Avilla, Chris Colbert, Jupyter core team member – Michael Droettboom, Matplotlib maintainer and core developer 7 Deep Domain & Python Knowledge
  8. 8. Financial  Services   –  Risk  Mgmt.,  Quant  modeling,  Data   exploration  and  processing,  algorithmic   trading,  compliance  reporting   Government   –  Fraud  detection,  data  crawling,  web  &   cyber  data  analytics,  statistical  modeling   Healthcare  &  Life  Sciences   –  Genomics  data  processing,  cancer   research,  natural  language  processing  for   health  data  science   High  Tech   –  Customer  behavior,  recommendations,  ad   bidding,  retargeting,  social  media  analytics   Retail  &  CPG   –  Engineering  simulation,  supply  chain   modeling,  scientific  analysis   Oil  &  Gas   –  Pipeline  monitoring,  noise  logging,  seismic   data  processing,  geophysics   8 Trusted by Industry Leaders
  9. 9. Leading Open Data Science Platform powered by Python Quickly Engage w/ Your Data 9 • 720+ Popular Packages   • Optimized & Compiled   • Free for Everyone   • Extensible via conda Package Manager   • Sandbox Packages & Libraries   • Cross-Platform - Windows, Linux, Mac   • Not just Python - over 230 R packages   • Foundation of our Enterprise Products Anaconda
  10. 10. 10 Anaconda Accelerating Adoption of Python for Enterprises COLLABORATIVE NOTEBOOKS with publication,authentication,& search Jupyter/ IPython PYTHON & PACKAGE MANAGEMENT for Hadoop & Apache stack Spark PERFORMANCE with compiled Python for lightning fast execution Numba VISUAL APPS for interactivity, streaming,& Big Bokeh SECURE & ROBUST REPOSITORY of data science libraries,scripts, & notebooks Conda ENTERPRISE DATA INTEGRATION with optimized connectors & out-of-core processing NumPy & Pandas DaskPARALLEL COMPUTING scaling up Python analytics on your cluster for interactivity and streaming data
  11. 11. YARN JVM Bottom Line
 10-100X faster performance • Interact with data in HDFS and Amazon S3 natively from Python • Distributed computations without the JVM & Python/Java serialization • Framework for easy, flexible parallelism using directed acyclic graphs (DAGs) • Interactive, distributed computing with in-memory persistence/caching Bottom Line • Leverage Python & R with Spark Batch Processing Interactive Processing HDFS Ibis Impala PySpark & SparkR Python & R ecosystem MPI High Performance, Interactive, Batch Processing Native read & write NumPy, Pandas, … 720+ packages 11
  12. 12. Overview of Dask as a Distributed Processing Framework
  13. 13. Overview of Dask 13 Dask is a Python parallel computing library that is: • Familiar: Implements parallel NumPy and Pandas objects • Fast: Optimized for demanding for numerical applications • Flexible: for sophisticated and messy algorithms • Scales up: Runs resiliently on clusters of 100s of machines • Scales down: Pragmatic in a single process on a laptop • Interactive: Responsive and fast for interactive data science Dask complements the rest of Anaconda. It was developed with
 NumPy, Pandas, and scikit-learn developers.
  14. 14. Spectrum of Parallelization 14 Threads Processes MPI ZeroMQ Dask Hadoop Spark SQL: Hive Pig Impala Implicit control: Restrictive but easyExplicit control: Fast but hard
  15. 15. Dask: From User Interaction to Execution 15
  16. 16. Dask Collections: Familiar Expressions and API 16 x.T - x.mean(axis=0) df.groupby(df.index).value.mean() def load(filename): def clean(data): def analyze(result): Dask array (mimics NumPy) Dask dataframe (mimics Pandas) Dask imperative (wraps custom code) b.map(json.loads).foldby(...) Dask bag (collection of data)
  17. 17. Dask Graphs: Example Machine Learning Pipeline 17
  18. 18. Dask Graphs: Example Machine Learning Pipeline + Grid Search 18
  19. 19. Scheduler Worker Worker Worker Worker Client Same network User Machine (laptop)Client Worker Dask Schedulers: Example - Distributed Scheduler 19
  20. 20. Example Parallel Workflows with Anaconda and Dask
  21. 21. Examples 21 Analyzing NYC Taxi CSV data using distributed Dask DataFrames • Demonstrate Pandas at scale • Observe responsive user interface Distributed language processing with text data using Dask Bags • Explore data using a distributed memory cluster • Interactively query data using libraries from Anaconda Analyzing global temperature data using Dask Arrays • Visualize complex algorithms • Learn about dask collections and tasks Handle custom code and workflows using Dask Imperative • Deal with messy situations • Learn about scheduling 1 2 3 4
  22. 22. Example 1: Using Dask DataFrames on a cluster with CSV data 22 • Built from Pandas DataFrames • Match Pandas interface • Access data from HDFS, S3, local, etc. • Fast, low latency • Responsive user interface January, 2016 Febrary, 2016 March, 2016 April, 2016 May, 2016 Pandas DataFrame} Dask DataFrame }
  23. 23. Example 2: Using Dask Bags on a cluster with text data 23 • Distributed natural language processing with text data stored in HDFS • Handles standard computations • Looks like other parallel frameworks (Spark, Hive, etc.) • Access data from HDFS, S3, local, etc. • Handles the common case ... (...) data ... (...) data function ... ... (...) data function ... result merge ... ... data function (...) ... function
  24. 24. NumPy Array } }Dask Array Example 3: Using Dask Arrays with global temperature data 24 • Built from NumPy
 n-dimensional arrays • Matches NumPy interface (subset) • Solve medium-large problems • Complex algorithms
  25. 25. Example 4: Using Dask Delayed to handle custom workflows 25 • Manually handle functions to support messy situations • Life saver when collections aren't flexible enough • Combine futures with collections for best of both worlds • Scheduler provides resilient and elastic execution
  26. 26. Precursors to Parallelism 26 • Consider the following approaches first: 1. Use better algorithms 2. Try Numba or C/Cython 3. Store data in efficient formats 4. Subsample your data • If you have to parallelize: 1. Start with your laptop (4 cores, 16 GB RAM, 1 TB disk) 2. Then a large workstation (24 cores, 1 TB RAM) 3. Finally, scale out to a cluster
  27. 27. Using Anaconda with Dask
  28. 28. Cluster Architecture Diagram 28 Client Machine Compute Node Compute Node Compute Node Head Node
  29. 29. • Single machine with multiple threads or processes • On a cluster with SSH (dcluster) • Resource management: YARN (knit), SGE, Slurm • On the cloud with Amazon EC2 (dec2) • On a cluster with Anaconda for cluster management • Manage multiple conda environments and packages 
 on bare-metal or cloud-based clusters Using Anaconda and Dask on your Cluster 29
  30. 30. • Dynamically manage Python and conda environments across a cluster • Works with enterprise Hadoop distributions and HPC clusters • Integrates with on-premises Anaconda repository • Cluster management features are available
 with Anaconda subscriptions Anaconda for Cluster Management 30
  31. 31. Cluster Deployment & Operations Before Anaconda for cluster management Head Node 1. Manually install Python, packages & dependencies 2. Manually install R, packages & dependencies After Anaconda for cluster management Compute Nodes 1. Manually install Python, packages & dependencies 2. Manually install R, packages & dependencies Bottom Line • Empower IT with scalable and supported Anaconda deployments • Fast, secure and scalable Python & R package management on tens or thousands of nodes • Backed by an enterprise configuration management system • Scalable Anaconda deployments tested in enterprise Hadoop and HPC environments Compute Nodes Head Node Easily install conda environments and packages (including Python and R) across cluster nodes 31
  32. 32. Admin Edge Node Compute Nodes Using Dask and Anaconda Enterprise on your Cluster 32 Analyst Machine Anaconda Repository Hadoop or HPC Cluster Dask Anaconda
  33. 33. Analyst Machine Anaconda Repository Hadoop or HPC Cluster Dask Anaconda Analyst ships packages and environments to on-premises repository1. Using Dask and Anaconda Enterprise on your Cluster 33
  34. 34. Admin deploys conda packages and environments to cluster nodes2. Using Dask and Anaconda Enterprise on your Cluster 34 Admin Head Node Compute Nodes Anaconda Repository Hadoop or HPC Cluster Dask Anaconda
  35. 35. Analyst submits distributed jobs and utilizes Anaconda on the cluster3. Using Dask and Anaconda Enterprise on your Cluster 35 Analyst Machine Head Node Compute Nodes Hadoop or HPC Cluster Dask Anaconda
  36. 36. Service Architecture Diagram 36 HMS CM S HS NN RM JN ID SG SNN NM DN G WHCS HS2 ACH Anaconda Cluster Head ACC AR CM G AR ACH Head Node DS JN YG YG G Secondary Head Node ICS ISS S YG Edge Node HFS HFS G H HS2 HMS WHCS Edge Node H SG Anaconda Repository Jupyter Notebook Hadoop Manager Zookeeper Server Impala Daemon History Sever (Spark) Spark Gateway Resource Manage (YARN) Other Services Hue NameNode (HDFS) Secondary NameNode DataNode HttpFS Hive Metastore Gateway WebHCat Server HiveServer2 Yarn GateWay NodeManager Anaconda Cluster Compute ACCACC Compute Nodes DN ID SG ACC DS Dask Scheduler DW Dask Worker DW
  37. 37. acluster conda install numpy scipy pandas numba acluster conda create -n py34 python=3.4 numpy scipy pandas acluster conda list acluster conda info acluster conda push environment.yml Remote Conda Commands 37 Install packages List packages Create environment Conda information Push environment
  38. 38. Cluster Management Commands 38 Create cluster Install plugins List active clusters SSH to nodes Put/get files Run command acluster create dask-cluster -p dask-profile acluster list acluster install notebook distributed acluster ssh acluster put data.hdf5 /home/ubuntu/data.hdf5 acluster 'cmd apt-get install ...'
  39. 39. Solutions with Anaconda and Dask
  40. 40. • Open source foundational components • Dask, Distributed scheduler, HDFS reading/writing,
 YARN interoperability, S3 integration, EC2 provisioning • Enterprise products / subscriptions • Anaconda Workgroup and Anaconda Enterprise • Package management on Hadoop and HPC clusters • Integration with on-premises repository • Provisioning and managing Dask workers on a cluster Working with Anaconda and Dask 40
  41. 41. Application Analytics Data and
 Resource Management Server Jupyter/IPython Notebook pandas, NumPy, SciPy, Numba, NLTK, scikit-learn, scikit-image,
 and more from Anaconda … HDFS, YARN, SGE, Slurm or other distributed systems Bare-metal or Cloud-based Cluster Anaconda Parallel Computation Dask Spark Hive / Impala Cluster 41
  42. 42. DISTRIBUTED 42 Automatic & flexible visualization of billions of points in real-time Interactive querying, exploration and browser visualization Distributed + Remote Query & Computation Interactive Big Data Visualization in Browser DATA SHADINGINSIDE HADOOP High Performance with Anaconda, including Dask Use all available cores/GPUs for distributed & threaded analysis Distributed High- Performance Analytics Recent Work using Dask
  43. 43. Use Cases with Anaconda and Dask 43 • Runs on a single machine or 100s of nodes • Works on cloud-based or bare-metal clusters • Works with enterprise Hadoop distributions and HPC environments • Develop workflows with text processing, statistics, machine learning, image processing, etc. • Works with data in various formats and storage solutions
  44. 44. Solutions with Anaconda and Dask 44 • Architecture consulting and review • Manage Python packages and environments on a cluster • Develop custom package management solutions on existing clusters • Migrate and parallelize existing code with Python and Dask • Architect parallel workflows and data pipelines with Dask • Build proof of concepts and interactive applications with Dask • Custom product/OSS core development • Training on parallel development with Dask
  45. 45. Anaconda Subscriptions 45
  46. 46. Additional Resources
  47. 47. $ conda install anaconda-client $ anaconda login $ conda install anaconda-cluster -c anaconda-cluster $ acluster create cluster-dask -p cluster-dask $ acluster install distributed Test-Drive Anaconda and Dask on your Cluster 1. Register for an Anaconda Cloud account at Anaconda.org 2. Download Anaconda for cluster management using Conda 3. Create a sandbox/demo cluster 4. Install Dask and the distributed scheduler 47
  48. 48. Contact Information and Additional Details • Contact sales@continuum.io for information about Anaconda subscriptions, consulting, or training and support@continuum.io for product support • More information about Anaconda Subscriptions
 continuum.io/anaconda-subscriptions • View Dask documentation and additional examples at dask.pydata.org 48
  49. 49. Kristopher Overholt Twitter: @koverholt Matthew Rocklin Twitter: @mrocklin Thank you 49 Email: sales@continuum.io Twitter: @ContinuumIO
  • GeorvicTur

    Sep. 3, 2018
  • SarahSABATIER

    May. 26, 2018
  • QiangGao12

    Nov. 30, 2017
  • RavirajAdrangi

    Aug. 23, 2017
  • prasenjit1980

    Apr. 25, 2017
  • PengZhang40

    Dec. 19, 2016
  • HariPrasad338

    Aug. 21, 2016
  • LiLing3

    Jul. 20, 2016
  • samasaikiran

    Jul. 18, 2016
  • forvaidya

    Jun. 14, 2016
  • gauravchawla300393

    Jun. 2, 2016
  • wengyan

    May. 28, 2016
  • yogeshdc

    May. 8, 2016
  • accaminero

    May. 6, 2016
  • allandieguez

    Apr. 26, 2016
  • wmqiadm

    Apr. 25, 2016
  • MuhammedEltabakh

    Apr. 25, 2016
  • LionelJohnnes

    Apr. 21, 2016
  • RoccoMicheleLancello

    Apr. 21, 2016
  • frathgeber

    Apr. 17, 2016

Scale Up & Scale Out with Anaconda Python is the fastest growing Open Data Science language & is used more than 50% of the time to extract value from Big Data in Spark. However, both PySpark & SparkR involve JVM overhead and Python/Java serialization when interacting with Spark which negatively impacts the time-to-value from your Big Data. What if there was a way to leverage the entire Python ecosystem without refactoring your Hadoop-based data science investments & get high performance? Anaconda, the leading Open Data Science Platform, delivers high performance Python for Hadoop. You get to leverage your existing Python-based data science investments with your existing Hadoop or HPC clusters. Anaconda bypasses the typical Hadoop performance issues, leverages existing high performance scientific and array-based computing in Python and now leverages Dask, the powerful parallel execution framework, to deliver fast results on any enterprise Hadoop distribution such as Cloudera & Hortonworks. On April 13th, Dr. Kristopher Overholt & Dr. Matthew Rocklin of Continuum Analytics present a webinar on High Performance Hadoop with Python. In this webinar, you'll learn to: -Analyze NYC taxi data through distributed DataFrames on a cluster on HDFS -Create interactive distributed visualizations of global temperature data -Distribute in-memory natural language processing & interactive queries on text data in HDFS -Wrap and parallelize existing legacy code on custom file formats

Views

Total views

13,846

On Slideshare

0

From embeds

0

Number of embeds

137

Actions

Downloads

442

Shares

0

Comments

0

Likes

29

×