Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Science for Big Data
with Anaconda Enterprise
Let Anaconda Take Your Organization to the Next Level
Daniel Rodriguez,...
Data Scientist
Daniel Rodriguez
Daniel Rodriguez is a Data Scientist and Software Developer
with over five years’ experien...
Product Marketing Manager
Gus Cavanaugh
Gus Cavanaugh is a Product Marketing Manager at Anaconda, where he
focuses on tran...
© 2017 Anaconda, Inc. - Confidential & Proprietary
Agenda
• Install Anaconda Distribution on a cluster
• Review the data a...
© 2017 Anaconda, Inc. - Confidential & Proprietary
Install Anaconda Distribution on a Cluster
• Two options:
• Build a cus...
© 2017 Anaconda, Inc. - Confidential & Proprietary
CDH Parcel and Ambari Mgmt Pack Generation
6
Anaconda Enterprise offers...
© 2017 Anaconda, Inc. - Confidential & Proprietary 7
Add packages and versions to distribution
CDH Parcel and Ambari Mgmt ...
© 2017 Anaconda, Inc. - Confidential & Proprietary
Install Anaconda Parcel on a CDH Cluster
Add Anaconda parcel to CDH via...
© 2017 Anaconda, Inc. - Confidential & Proprietary
Connect Spark to Anaconda Enterprise
• Install Livy on
edge node
• Star...
© 2017 Anaconda, Inc. - Confidential & Proprietary
• Add Livy server to
Sparkmagic config
in your project
• Start doing yo...
© 2017 Anaconda, Inc. - Confidential & Proprietary
Review the Data
• Format - line delimited JSON
• We transferred the dat...
© 2017 Anaconda, Inc. - Confidential & Proprietary
Review the Data: ETL
• Distributed copy from Hadoop
• Download a JSON s...
© 2017 Anaconda, Inc. - Confidential & Proprietary
Review the Data: ETL
13
hadoop distcp s3n://{{ AWS_KEY }}:{{ AWS_SECRET...
© 2017 Anaconda, Inc. - Confidential & Proprietary
Review the Data: ETL
14
hive > ADD JAR jsonserde.jar;
hive > CREATE TAB...
© 2017 Anaconda, Inc. - Confidential & Proprietary
Review the Data: ETL
15
hive > CREATE TABLE reddit_parquet (
archived b...
© 2017 Anaconda, Inc. - Confidential & Proprietary
Review the Data: ETL
16
hive > set dfs.block.size=1g;
set hive.exec.dyn...
© 2017 Anaconda, Inc. - Confidential & Proprietary
Analyze Data with Python and R
• SparklyR is one R API for
Spark
• PySp...
© 2017 Anaconda, Inc. - Confidential & Proprietary
Build an Application
• Impala is great for SQL queries on
Hadoop
• With...
© 2017 Anaconda, Inc. - Confidential & Proprietary
Deploy Application
• Anaconda Enterprise 5 offers one-
click deployment...
© 2017 Anaconda, Inc. - Confidential & Proprietary
DEMO
20
21© 2017 Anaconda, Inc. - Confidential & Proprietary
Upcoming SlideShare
Loading in …5
×

Data science for Big Data

Getting Python and R’s most popular data science libraries to work on a computational cluster can be a major challenge. And in a Big Data world, surmounting this challenge is key to leveraging data science within your organization to make smart, data-driven decisions.

In this live webinar from Team Anaconda, we’ll demonstrate how easily the Anaconda Enterprise data science platform integrates with Hadoop or Spark clusters, giving your data scientists access to the libraries they need and empowering you to extract the most value from your Big Data. The webinar is intended for IT managers and experienced Python and R users interested in writing code on Hadoop/Spark clusters.

You'll learn how Anaconda Enterprise:
- Enables runtime distribution for Hadoop and Spark jobs
- Gathers data from disparate sources for analysis
- Easily connects to your Spark clusters and queries data from Hadoop
- Features distributed computing with Dask
- Deploys your Big Data applications

  • Be the first to comment

Data science for Big Data

  1. 1. Data Science for Big Data with Anaconda Enterprise Let Anaconda Take Your Organization to the Next Level Daniel Rodriguez, Data Scientist Gus Cavanaugh, Product Marketing Manager
  2. 2. Data Scientist Daniel Rodriguez Daniel Rodriguez is a Data Scientist and Software Developer with over five years’ experience in areas ranging from DevOps to machine learning. He has performed data analysis and data engineering in big data environments across various industries. Daniel holds a degree in Electrical Engineering from Universidad de los Andes Colombia, and an MS in Science in IT Management from UT Dallas. He is passionate about open source data technologies and has spoken at PyData and Spark Summit. 2© 2017 Anaconda, Inc. - Confidential & Proprietary
  3. 3. Product Marketing Manager Gus Cavanaugh Gus Cavanaugh is a Product Marketing Manager at Anaconda, where he focuses on translating technical capabilities into user benefits. He has over five years’ experience in analytics and consulting for enterprises. Prior to joining Anaconda, he worked on projects ranging from small scale data apps and dashboards to distributed Hadoop clusters at companies including IBM and Booz Allen Hamilton. Gus holds an MS in Systems Engineering from George Washington University and a BS in Business Administration from Washington & Lee University. He is a frequent speaker on analytics topics for non- technical audiences. 3© 2017 Anaconda, Inc. - Confidential & Proprietary
  4. 4. © 2017 Anaconda, Inc. - Confidential & Proprietary Agenda • Install Anaconda Distribution on a cluster • Review the data and ETL process • Analyze data with: • Spark: Python & R • Impala • One-click deploy an application with Anaconda Enterprise with Python and R 4
  5. 5. © 2017 Anaconda, Inc. - Confidential & Proprietary Install Anaconda Distribution on a Cluster • Two options: • Build a custom Cloudera CDH Parcel or Ambari Management pack • Create/Ship on the fly runtime distribution 5 Python & R runtime on Hadoop
  6. 6. © 2017 Anaconda, Inc. - Confidential & Proprietary CDH Parcel and Ambari Mgmt Pack Generation 6 Anaconda Enterprise offers UI for building custom distributions
  7. 7. © 2017 Anaconda, Inc. - Confidential & Proprietary 7 Add packages and versions to distribution CDH Parcel and Ambari Mgmt Pack Generation
  8. 8. © 2017 Anaconda, Inc. - Confidential & Proprietary Install Anaconda Parcel on a CDH Cluster Add Anaconda parcel to CDH via Cloudera Manager 8 https://docs.anaconda.com/anaconda/user-guide/tasks/integration/cloudera
  9. 9. © 2017 Anaconda, Inc. - Confidential & Proprietary Connect Spark to Anaconda Enterprise • Install Livy on edge node • Start the Livy server Connect Notebooks to Spark via Apache Livy & Sparkmagic 9
  10. 10. © 2017 Anaconda, Inc. - Confidential & Proprietary • Add Livy server to Sparkmagic config in your project • Start doing your analysis using Spark inside the notebooks Connect Notebooks to Spark via Apache Livy & Sparkmagic 10 Connect Spark to Anaconda Enterprise
  11. 11. © 2017 Anaconda, Inc. - Confidential & Proprietary Review the Data • Format - line delimited JSON • We transferred the data to S3 • Using our Hadoop cluster, we can load the data from S3 11 3 Billion Reddit comments (2007-2017) • Source: s3://anaconda-public-datasets/reddit/json
  12. 12. © 2017 Anaconda, Inc. - Confidential & Proprietary Review the Data: ETL • Distributed copy from Hadoop • Download a JSON serializer for Parquet • Transform the data into Parquet using Hive ◦ Parquet is columnar data store that makes it easy to make fast reads 12 Simple ETL process
  13. 13. © 2017 Anaconda, Inc. - Confidential & Proprietary Review the Data: ETL 13 hadoop distcp s3n://{{ AWS_KEY }}:{{ AWS_SECRET }}@anaconda-publi wget http://s3.amazonaws.com/elasticmapreduce/samples/hive- ads/libs/jsonserde.jar Move data Get JSON serializer
  14. 14. © 2017 Anaconda, Inc. - Confidential & Proprietary Review the Data: ETL 14 hive > ADD JAR jsonserde.jar; hive > CREATE TABLE reddit_json ( archived boolean, author string, author_flair_css_class string, author_flair_text string, body string, controversiality int, created_utc string, distinguished string, downs int, edited boolean, gilded int, id string, link_id string, name string, parent_id string, removal_reason string, retrieved_on timestamp, score int, score_hidden boolean, subreddit string, subreddit_id string, ups int ) ROW FORMAT serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ('paths'='archived,author,author_flair_css_class,author_flair_text,body,controversiality,created_utc,distinguished,downs,edited,gilded,id ,link_id,name,parent_id,removal_reason,retrieved_on,score,score_hidden,subreddit,subreddit_id,ups'); hive > LOAD DATA INPATH '/user/centos/RC_*' INTO TABLE reddit_json;
  15. 15. © 2017 Anaconda, Inc. - Confidential & Proprietary Review the Data: ETL 15 hive > CREATE TABLE reddit_parquet ( archived boolean, author string, author_flair_css_class string, author_flair_text string, body string, controversiality int, created_utc string, distinguished string, downs int, edited boolean, gilded int, id string, link_id string, name string, parent_id string, removal_reason string, retrieved_on timestamp, score int, score_hidden boolean, subreddit string, subreddit_id string, ups int, created_utc_t timestamp ) PARTITIONED BY (date_str string) STORED AS PARQUET;
  16. 16. © 2017 Anaconda, Inc. - Confidential & Proprietary Review the Data: ETL 16 hive > set dfs.block.size=1g; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.max.dynamic.partitions=1000; set hive.exec.max.dynamic.partitions.pernode=1000; set hive.optimize.sort.dynamic.partition=true; hive > INSERT OVERWRITE TABLE reddit_parquet PARTITION (date_str) SELECT *, cast(cast(created_utc as double) as timestamp) as created_utc_t, date_format(cast(cast(created_utc as double) as timestamp),'yyyy-MM') as date_str FROM reddit_json;
  17. 17. © 2017 Anaconda, Inc. - Confidential & Proprietary Analyze Data with Python and R • SparklyR is one R API for Spark • PySpark is the Python API for SparK 17 Using PySpark and SparklyR
  18. 18. © 2017 Anaconda, Inc. - Confidential & Proprietary Build an Application • Impala is great for SQL queries on Hadoop • With Anaconda Enterprise, you aren’t limited to just Spark, Python and R. You can use whichever tools you are familiar with 18
  19. 19. © 2017 Anaconda, Inc. - Confidential & Proprietary Deploy Application • Anaconda Enterprise 5 offers one- click deployments in Python or R • Easily deploy notebooks, APIs, dashboards, and web applications 19
  20. 20. © 2017 Anaconda, Inc. - Confidential & Proprietary DEMO 20
  21. 21. 21© 2017 Anaconda, Inc. - Confidential & Proprietary

×