Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

data science toolkit 101: set up Python, Spark, & Jupyter

302 views

Published on

Data science toolkit 101. set up Python, Spark, & Jupyter on a Mac laptop

Published in: Technology
  • Be the first to comment

data science toolkit 101: set up Python, Spark, & Jupyter

  1. 1. IBM Cloud Data Services data science toolkit 101 set up Python, Spark, & Jupyter Raj Singh, PhD Developer Advocate: Geo | Open Data rrsingh@us.ibm.com http://ibm.biz/rajrsingh twitter: @rajrsingh
  2. 2. @rajrsingh IBM Cloud Data Services Agenda • Installation • Python • Spark • Pixiedust • Examples
  3. 3. @rajrsingh IBM Cloud Data Services IBM Analytics Data Science Experience (DSX)
  4. 4. @rajrsingh IBM Cloud Data Services What is Spark? • In-memory Hadoop • Hadoop was massively scalable but slow • “Up to 100x faster” (10x faster if memory is exhausted) • What is Hadoop? • HDFS: fault-tolerant storage using horizontally scalable commodity hardware • MapReduce: programming style for distributed processing • Presents data as an object independent of the underlying storage
  5. 5. @rajrsingh IBM Cloud Data Services Spark abstracted storage • Scala • PySpark = (Spark + Python) • Drivers • File storage • Cloudant • dashDB • Cassandra • …
  6. 6. @rajrsingh IBM Cloud Data Services Python installation with miniconda 1. https://www.continuum.io/downloads (choose version 2.7) 2. Miniconda2 install into this location: /Users/<username>/miniconda2 3. bash$ conda install pandas jupyter matplotlib 4. bash$ which python /Users/<username>/miniconda2/bin/python https://dzone.com/refcardz/apache-spark
  7. 7. @rajrsingh IBM Cloud Data Services Spark installation • http://spark.apache.org/downloads.html • Spark release: 1.6.2 • package type: Pre-built for Hadoop 2.6 • mkdir dev • cd dev • tar xzf ~/Downloads/spark-1.6.2-bin-hadoop2.6.tgz • ln -s spark-1.6.2-bin-hadoop2.6 spark • mkdir dev/notebooks
  8. 8. @rajrsingh IBM Cloud Data Services PySpark configuration • create directory ~/.ipython/kernels/pyspark1.6/ • create file kernel.json • cd ~/dev/spark/conf • cp spark-defaults.conf.template spark-defaults.conf • add to end of spark-defaults.conf: spark.driver.extraClassPath=<HOME DIRECTORY>/data/libs/* { "display_name": "pySpark (Spark 1.6.2) Python 2", "language": "python", "argv": [ "/Users/sparktest/miniconda2/bin/python", "-m", "ipykernel", "-f", "{connection_file}" ], "env": { "SPARK_HOME": "/Users/sparktest/dev/spark", "PYTHONPATH": "/Users/sparktest/dev/spark/python/:/Users/sparktest/dev/spark/python/lib/py4j-0.9-src.zip", "PYTHONSTARTUP": "/Users/sparktest/dev/spark/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS": "--master local[10] pyspark-shell", "SPARK_DRIVER_MEMORY": "10G", "SPARK_LOCAL_IP": "127.0.0.1" } }
  9. 9. @rajrsingh IBM Cloud Data Services PySpark test • bash$ cd ~/dev • bash$ jupyter notebook • upper right of the Jupyter screen, click New, choose pySpark (Spark 1.6.2) Python 2 (or whatever name specified in your kernel.json file) • in the notebook's first cell enter sc.version and click the >| button to run it (or hit CTRL + Enter).
  10. 10. @rajrsingh IBM Cloud Data Services Pixiedust installation • cd ~/dev • git clone https://github.com/ibm-cds-labs/pixiedust.git • pip install --user --upgrade --no-deps -e /Users/sparktest/dev/pixiedust • pip install maven-artifact • pip install mpld3
  11. 11. @rajrsingh IBM Cloud Data Services Examples • Pixiedust • https://github.com/ibm-cds-labs/pixiedust • Demographic analyses • http://ibm-cds-labs.github.io/open-data/samples/ • or https://github.com/ibm-cds-labs/open-data/tree/master/samples
  12. 12. IBM Cloud Data Services Raj Singh Developer Advocate: Geo | Open Data rrsingh@us.ibm.com http://ibm.biz/rajrsingh Twitter: @rajrsingh LinkedIn: rajrsingh Thanks

×