Python for Data Science                            PyCon Finland 17.10.2011                                Harri Hämäläine...
What is Data Science                    • Data science ~ computer science +                            mathematics/statist...
Scope of the talk                    • What is in this presentation                            -   Pythonized tools for re...
Outline                    •       Harvesting                    •       Cleaning                    •                    ...
Data harvestingWednesday, October 19, 11
Authorative borders                             for data sources                    1. Data from your system              ...
Data sources                    • Locally available data                    • Data dumps from Web                    • Dat...
API                    • Available for many web applications                            accessible with general Python lib...
Data sources                    • Locally available data                    • Data dumps in Web                    • Data ...
Crawling                    • Crawlers (spiders, web robots) are used to                            autonomously navigate ...
Web Scraping                    •       Extract information from structured documents in                            Web   ...
urllib & BeautifulSoup                            >>> import urllib, BeautifulSoup                            >>> fd = url...
Scrapy                    •       A framework for crawling web sites and extracting                            structured ...
Tips and ethics                    •       Use the mobile version of the sites if available                    •       No ...
Data cleansing                             (preprocessing)Wednesday, October 19, 11
Data cleansing                    • Harvested data may come with lots of                            noise                 ...
Data preprocessing                    • Goal: provide structured presentation for                            analysis     ...
Network representation                    • Vast number of datasets are describing a                            network   ...
Analyzing the DataWednesday, October 19, 11
•       Offers efficient          • Builds on top of NumPy                            multidimensional array               ...
>>> pylab.scatter(weights, heights)           >>> xlabel("weight (kg)")                                                 Cu...
NumPy + SciPy +                            Matplotlib + IPython                      • Provides Matlab ”-ish” environment ...
Analyzing networks                    • Basicly two bigger alternatives                            -   NetworkX           ...
NetworkX                    • Python package for dealing with complex                            networks:                ...
NetworkX                            >>> import networkx                            >>> g = networkx.Graph()               ...
Visualizing DataWednesday, October 19, 11
NetworkX                            >>> import networkx                            >>> g = networkx.Graph()               ...
PyGraphviz                    • Python interface for the Graphviz layout                            engine                ...
>>> pylab.scatter(weights, heights)           >>> xlabel("weight (kg)")           >>> ylabel("height (cm)")           >>> ...
3D visualizations                                           with mplot3d toolkitWednesday, October 19, 11
Data publishingWednesday, October 19, 11
Open Data                    •       Certain data should be open and therefore available to everyone                      ...
The Zen of Open Data                                                Open is better than closed.                           ...
Sharing the Data                    • Some convenient formats                            -   JSON (import simplejson)     ...
Resource Description Framework                             (RDF)                    • Collection of W3C standards for mode...
RDF Data     @prefix poi:           <http://schema.onki.fi/poi#> .     @prefix skos:          <http://www.w3.org/2004/02/s...
from SPARQLWrapper import SPARQLWrapper, JSON    QUERY = """       Prefix lgd:<http://linkedgeodata.org/>       Prefix lgd...
RDF Data sources                    •       CKAN                    •       DBpedia                    •       LinkedGeoDa...
harri.hamalainen aalto fiWednesday, October 19, 11
Upcoming SlideShare
Loading in...5
×

Python for Data Science

5,233

Published on

The amount of data available to us is growing rapidly, but what is required to make useful conclusions out of it?

Outline
1. Different tactics to gather your data
2. Cleansing, scrubbing, correcting your data
3. Running analysis for your data
4. Bring your data to live with visualizations
5. Publishing your data for rest of us as linked open data

Published in: Technology, Design
1 Comment
20 Likes
Statistics
Notes
  • tags python data analysis
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
5,233
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
192
Comments
1
Likes
20
Embeds 0
No embeds

No notes for slide

Python for Data Science

  1. 1. Python for Data Science PyCon Finland 17.10.2011 Harri Hämäläinen harri.hamalainen aalto fi #sgwwxWednesday, October 19, 11
  2. 2. What is Data Science • Data science ~ computer science + mathematics/statistics + visualization • Most web companies are actually doing some kind of Data Science: - Facebook, Amazon, Google, LinkedIn, Last.fm... - Social network analysis, recommendations, community building, decision making, analyzing emerging trends, ...Wednesday, October 19, 11
  3. 3. Scope of the talk • What is in this presentation - Pythonized tools for retrieving and dealing with data - Methods, libraries, ethics • What is not included - Dealing with very large data - Math or detailed algorithms behind library callsWednesday, October 19, 11
  4. 4. Outline • Harvesting • Cleaning • • Analyzing Visualizing Data • PublishingWednesday, October 19, 11
  5. 5. Data harvestingWednesday, October 19, 11
  6. 6. Authorative borders for data sources 1. Data from your system - e.g. user access log files, purchase history, view counts - e.g. sensor readings, manual gathering 2. Data from the services of others - e.g. view counts, tweets, housing prizes, sport results, financing dataWednesday, October 19, 11
  7. 7. Data sources • Locally available data • Data dumps from Web • Data through Web APIs • Structured data in Web documentsWednesday, October 19, 11
  8. 8. API • Available for many web applications accessible with general Python libraries - urllib, soaplib, suds, ... • Some APIs available even as application specific Python libraries - python-twitter, python-linkedin, pyfacebook, ...Wednesday, October 19, 11
  9. 9. Data sources • Locally available data • Data dumps in Web • Data through Web APIs • Structured data in Web documentsWednesday, October 19, 11
  10. 10. Crawling • Crawlers (spiders, web robots) are used to autonomously navigate the Web documents import urllib # queue holds urls, managed by some other component def crawler(queue): url = queue.get() fd = urllib.urlopen(url) content = fd.read() links = parse_links(content) # process content for link in links: queue.put(link)Wednesday, October 19, 11
  11. 11. Web Scraping • Extract information from structured documents in Web • Multiple libraries for parsing XML documents • But in general web documents are rarely valid XML • Some candidates who will stand by you when data contains “dragons” - BeautifulSoup - lxmlWednesday, October 19, 11
  12. 12. urllib & BeautifulSoup >>> import urllib, BeautifulSoup >>> fd = urllib.urlopen(http://mobile.example.org/ foo.html) >>> soup = BeautifulSoup.BeautifulSoup(fd.read()) >>> print soup.prettify() ... <table> <tr><td>23</td></tr> <tr><td>24</td></tr> <tr><td>25</td></tr> <tr><td>22</td></tr> <tr><td>23</td></tr> </table> ... >>> values = [elem.text for elem in soup.find(table).findChildren(td)] >>>> values [u23, u24, u25, u22, u23]Wednesday, October 19, 11
  13. 13. Scrapy • A framework for crawling web sites and extracting structured data • Features - extracts elements from XML/HTML (with XPath) - makes it easy to define crawlers with support more specific needs (e.g. HTTP compression headers, robots.txt, crawling depth) - real-time debugging • http://scrapy.orgWednesday, October 19, 11
  14. 14. Tips and ethics • Use the mobile version of the sites if available • No cookies • Respect robots.txt • Identify yourself • Use compression (RFC 2616) • If possible, download bulk data first, process it later • Prefer dumps over APIs, APIs over scraping • Be polite and request permission to gather the data • Worth checking: https://scraperwiki.com/Wednesday, October 19, 11
  15. 15. Data cleansing (preprocessing)Wednesday, October 19, 11
  16. 16. Data cleansing • Harvested data may come with lots of noise • ... or interesting anomalies? • Detection - Scatter plots - Statistical functions describing distributionWednesday, October 19, 11
  17. 17. Data preprocessing • Goal: provide structured presentation for analysis - Network (graph) - Values with dimensionWednesday, October 19, 11
  18. 18. Network representation • Vast number of datasets are describing a network - Social relations - Plain old web pages with links - Anything where some entities in data are related to each otherWednesday, October 19, 11
  19. 19. Analyzing the DataWednesday, October 19, 11
  20. 20. • Offers efficient • Builds on top of NumPy multidimensional array object, ndarray • Modules for • statistics, optimization, • Basic linear algebra signal processing, ... operations and data types • Add-ons (called SciKits) for • Requires GNU Fortran • machine learning • data mining • ...Wednesday, October 19, 11
  21. 21. >>> pylab.scatter(weights, heights) >>> xlabel("weight (kg)") Curve fitting with numpy polyfit & >>> ylabel("height (cm)") polyval functions >>> pearsonr(weights, heights) (0.5028, 0.0)Wednesday, October 19, 11
  22. 22. NumPy + SciPy + Matplotlib + IPython • Provides Matlab ”-ish” environment • ipython provides extended interactive interpreter (tab completion, magic functions for object querying, debugging, ...) ipython --pylab In [1]: x = linspace(1, 10, 42) In [2]: y = randn(42) In [3]: plot(x, y, r--., markersize=15)Wednesday, October 19, 11
  23. 23. Analyzing networks • Basicly two bigger alternatives - NetworkX - igraphWednesday, October 19, 11
  24. 24. NetworkX • Python package for dealing with complex networks: - Importers / exporters - Graph generators - Serialization - Algorithms - VisualizationWednesday, October 19, 11
  25. 25. NetworkX >>> import networkx >>> g = networkx.Graph() >>> data = {Turku: [(Helsinki, 165), (Tampere, 157)], Helsinki: [(Kotka, 133), (Kajaani, 557)], Tampere: [(Jyväskylä, 149)], Jyväskylä: [(Kajaani, 307)] } >>> for k,v in data.items(): ... for dest, dist in v: ... g.add_edge(k, dest, weight=dist) >>> networkx.astar_path_length(g, Turku, Kajaani) 613 >>> networkx.astar_path(g, Kajaani, Turku) [Kajaani, Jyväskylä, Tampere, Turku]Wednesday, October 19, 11
  26. 26. Visualizing DataWednesday, October 19, 11
  27. 27. NetworkX >>> import networkx >>> g = networkx.Graph() >>> data = {Turku: [(Helsinki, 165), (Tampere, 157)], Helsinki: [(Kotka, 133), (Kajaani, 557)], Tampere: [(Jyväskylä, 149)], Jyväskylä: [(Kajaani, 307)] } >>> for k,v in data.items(): ... for dest, dist in v: ... g.add_edge(k, dest, weight=dist) >>> networkx.astar_path_length(g, Turku, Kajaani) 613 >>> networkx.astar_path(g, Kajaani, Turku) [Kajaani, Jyväskylä, Tampere, Turku] >>> networkx.draw(g)Wednesday, October 19, 11
  28. 28. PyGraphviz • Python interface for the Graphviz layout engine - Graphviz is a collection of graph layout programsWednesday, October 19, 11
  29. 29. >>> pylab.scatter(weights, heights) >>> xlabel("weight (kg)") >>> ylabel("height (cm)") >>> pearsonr(weights, heights) (0.5028, 0.0) >>> title("Scatter plot for weight/height statisticsnn=25000, pearsoncor=0.5028") >>> savefig("figure.png")Wednesday, October 19, 11
  30. 30. 3D visualizations with mplot3d toolkitWednesday, October 19, 11
  31. 31. Data publishingWednesday, October 19, 11
  32. 32. Open Data • Certain data should be open and therefore available to everyone to use in a way or another • Some open their data to others hoping it will be beneficial for them or just because there’s no need to hide it • Examples of open dataset types - Government data - Life sciences data - Culture data - Commerce data - Social media data - Cross-domain dataWednesday, October 19, 11
  33. 33. The Zen of Open Data Open is better than closed. Transparent is better than opaque. Simple is better than complex. Accessible is better than inaccessible. Sharing is better than hoarding. Linked is more useful than isolated. Fine grained is preferable to aggregated. Optimize for machine readability — they can translate for humans. ... “Flawed, but out there” is a million times better than “perfect, but unattainable”. ... Chris McDowall & co.Wednesday, October 19, 11
  34. 34. Sharing the Data • Some convenient formats - JSON (import simplejson) - XML (import xml) - RDF (import rdflib, SPARQLWrapper) - GraphML (import networkx) - CSV (import csv)Wednesday, October 19, 11
  35. 35. Resource Description Framework (RDF) • Collection of W3C standards for modeling complex relations and to exchange information • Allows data from multiple sources to combine nicely • RDF describes data with triples - each triple has form subject - predicate - object e.g. PyconFi2011 is organized in TurkuWednesday, October 19, 11
  36. 36. RDF Data @prefix poi: <http://schema.onki.fi/poi#> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . ... <http://purl.org/finnonto/id/rky/p437> a poi:AreaOfInterest ; poi:description """Turun rautatieasema on maailmansotien välisen ajan merkittävimpiä asemarakennushankkeita Suomessa..."""@fi ; poi:hasPolygon "60.454833421,22.253543828 60.453846032,22.254787430 60.453815665,22.254725349..." ; poi:history """Turkuun suunniteltiin rautatietä 1860-luvulta lähtien, mutta ensimmäinen rautatieyhteys Turkuun..."""@fi ; poi:municipality kunnat:k853 ; poi:poiType poio:tuotantorakennus , poio:asuinrakennus , poio:puisto ; poi:webPage "http://www.rky.fi/read/asp/r_kohde_det.aspx?KOHDE_ID=1865" ; skos:prefLabel "Turun rautatieympäristöt"@fi . ...Wednesday, October 19, 11
  37. 37. from SPARQLWrapper import SPARQLWrapper, JSON QUERY = """ Prefix lgd:<http://linkedgeodata.org/> Prefix lgdo:<http://linkedgeodata.org/ontology/> Prefix rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> Select distinct ?label ?g From <http://linkedgeodata.org> { ?s rdf:type <http://linkedgeodata.org/ontology/Library> . ?s <http://www.w3.org/2000/01/rdf-schema#label> ?label. ?s geo:geometry ?g . Filter(bif:st_intersects (?g, bif:st_point (24.9375, 60.170833), 1)) . }""" sparql = SPARQLWrapper("http://linkedgeodata.org/sparql") sparql.setQuery(QUERY) sparql.setReturnFormat(JSON) results = sparql.query().convert()[results][bindings] for result in results: print result[label][value].encode(utf-8), result[g][value] German library POINT(24.9495 60.1657) Query for libraries in Metsätalo POINT(24.9497 60.1729) Opiskelijakirjasto POINT(24.9489 60.1715) Helsinki located within Topelia POINT(24.9493 60.1713) 1 kilometer radius from Eduskunnan kirjasto POINT(24.9316 60.1725) the city centre Rikhardinkadun kirjasto POINT(24.9467 60.1662) Helsinki 10 POINT(24.9386 60.1713)Wednesday, October 19, 11
  38. 38. RDF Data sources • CKAN • DBpedia • LinkedGeoData • DBTune • http://semantic.hri.fi/Wednesday, October 19, 11
  39. 39. harri.hamalainen aalto fiWednesday, October 19, 11
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×