@MargrietGr
Margriet Groenendijk
Developer Advocate for IBM Cloud Data Services
SW Cloud meetup
Bristol
24 November 2016
Data Science in the Cloud
@MargrietGr
About me
• Developer Advocate at IBM Cloud Data Services, UK
•Data science
•Python, Spark, R, Cloudant, dashDB
• Research Fellow at University of Exeter, UK
•Worked with very large observational datasets and
the output of global scale climate models
• PhD at Vrije Universiteit Amsterdam, the Netherlands
•Explored large observational datasets of carbon
uptake by forests
@MargrietGr
1781
http://visual.ly/exports-and-imports-scotland
@MargrietGr
1821
https://en.wikipedia.org/wiki/Charles_Joseph_Minard#/media/File:Minard.png
@MargrietGr
1960s
http://www.computerhistory.org/collections/catalog/102630767
@MargrietGr
1960s
http://www.climatecentral.org/news/first-climate-model-video-19007
@MargrietGr
Data
Engineers
Data
Scientists
Business
Analysts
App
Developers
Data Science is a Team Effort
Data
@MargrietGr
Toolbox
http://nirvacana.com/thoughts/wp-content/uploads/2013/07/RoadToDataScientist1.png
@MargrietGr
Data Science Workflow
@MargrietGr
Discover
Data
Use	
Data
Publish Data
Socialize	
Data
Data Science Workflow
@MargrietGr
Data Science
Workflow
Define	Question
Find	Data
Explore	Data
Clean	Data
Visualize	and	
Summarize	Data
Create	Predictive	
Models
Present	Results
@MargrietGr
Collect Data
APIs
Open Data
Maps
Web Scraping
Time Series
@MargrietGr
Store Data
Object Store - binary files
Relational database
Document store - json
Bluemix
https://console.ng.bluemix.net/
@MargrietGr
Explore Data
@MargrietGr
Explore	Data
Clean	DataStore	Data
@MargrietGr
Spark on a
Cluster
@MargrietGr
The Spark Stack
from Karau et al.: Learning Spark
@MargrietGr
RDDs : Resilient Distributed Datasets
• Data does not have to fit on a single machine
• Data is separated into partitions
• Creation of RDDs
•Load an external dataset
•Distribute a collection of objects
• Transformations construct a new RDD from a previous one (lazy!)
• Actions compute a result based on an RDD
@MargrietGr
Run Spark locally in a Python notebook
https://www.continuum.io/downloads
http://spark.apache.org/downloads.html
Create a new kernel to use in a
Jupyter notebook
@MargrietGr
Jupyter Notebooks!
• Server-client application to edit and run
notebook documents via a web browser
• Cells with:
•Code
•Figures and tables
•Rich text elements
• Different kernels: Python, R, Scala,
Spark
In the Cloud:
@MargrietGr
http://datascience.ibm.com/
@MargrietGr
@MargrietGr
@MargrietGr
@MargrietGr
Weather Data
@MargrietGr
Define Question
What will the weather be next weekend?
https://unsplash.com/search/autumn?photo=LSF8WGtQmn8
https://unsplash.com/search/rain?photo=19tQv51x4-A
@MargrietGr
Find Data
https://console.ng.bluemix.net/
@MargrietGr
Explore Data
Python packages
• requests and json
•API credentials and latitude/longitude of Bristol
•json data returned
• pandas, numpy and datetime
•convert json to pandas DataFrame (table with multiple indices)
•add time as index
@MargrietGr
Weather forecast for
Bristol
https://developer.ibm.com/
clouddataservices/2016/10/06/
your-own-weather-forecast-in-a-
python-notebook/
Visualize Data
Python packages
• pandas - rolling mean
• matplotlib
• Basemap
Demo
@MargrietGr
Weather map
https://
developer.ibm.com/
clouddataservices/
2016/10/06/your-own-
weather-forecast-in-a-
python-notebook/
Python packages
• matplotlib
• Basemap
• itertools
• urllib
@MargrietGr
@MargrietGr
@MargrietGr
Weather,Twitter and Sentiment
@MargrietGr
Weather, Twitter and Sentiment
• Where to find the data?
• Where to store the data?
• Where to analyse the data?
• Quick tools to explore
@MargrietGr
Insights for Twitter
@MargrietGr
Add sentiment - example
@MargrietGr
• watson tone analyser
Emotion
Language
style
Social
propensities
Analyze how you are coming across to others
@MargrietGr
Workflow
Weather Company
Data
crontab -e
0 23 * * * /path/to/file/do_something.sh
python do_something.py
Tweets
Weather
Sentiment
Watson Tone Analyser
Insights for Twitter
Cloudant NoSQL
@MargrietGr
PixieDust
https://github.com/ibm-cds-labs/pixiedust
Simpler Workflow
@MargrietGr
PixieDust: an Open Source Library that simplifies and
improves Jupyter Python Notebooks
• PackageManager
• Visualizations
• Cloud Integration
• Scala Bridge
• Extensibility
• Embedded Apps
https://developer.ibm.com/clouddataservices/2016/10/11/pixiedust-magic-for-
python-notebook/
@DTAIEB55
@MargrietGr
Install Spark packages or plain jars in your Notebook Python
kernel without the need to modify configuration file
Uses the GraphFrame Python APIs
Install GraphFrames Spark Package
@MargrietGr
One simple API: display()
Call the Options dialog
Panning/Zooming
options
Performance statistics
@MargrietGr
Easily export your data to csv, json, html, etc. locally on your laptop
or into a cloud-based service like Cloudant or Object Storage
@MargrietGr
Scala Bridge
Define a Python variable
Use the Python var in Scala
Define a Scala variable
Use the Scala var in Python
@MargrietGr
Easily extend PixieDust to create your own visualizations
using HTML/CSS/JavaScript
Customized
Visualization for
GraphFrame
Graphs
@MargrietGr
Encapsulate your analytics into compelling User
Interfaces better suited for Line of Business Users
@MargrietGr
@MargrietGr
IBM Watson Data Platform
• Data Science Experience
• Watson Data Platform
• Machine Learning
• Sign up for beta: http://datascience.ibm.com/features#machinelearning
@MargrietGr
@MargrietGr
https://developer.ibm.com/clouddataservices/author/
mgroenen/
Thanks!
Slides will be here:
http://www.slideshare.net/MargrietGroenendijk

Data Science in the Cloud