Python business intelligence (PyData 2012 talk)

  • 10,413 views
Uploaded on

What is the state of business intelligence tools in Python in 2012? How Python is used for data processing and analysis? Different approaches for business data and scientific data. …

What is the state of business intelligence tools in Python in 2012? How Python is used for data processing and analysis? Different approaches for business data and scientific data.

Video: https://vimeo.com/53063944

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
10,413
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
235
Comments
0
Likes
43

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Python for Business IntelligenceŠtefan Urbánek ■ @Stiivi ■ stefan.urbanek@continuum.io ■ PyData NYC, October 2012
  • 2. python business intelligence )
  • 3. ResultsQ/A and articles with Java solution references (not listed here)
  • 4. Why?
  • 5. Overview■ Traditional Data Warehouse■ Python and Data■ Is Python Capable?■ Conclusion
  • 6. BusinessIntelligence
  • 7. peopletechnology processes
  • 8. Data Analysis and Extraction, Transformation, LoadingSources Presentation Data Governance Technologies and Utilities
  • 9. Traditional Data Warehouse
  • 10. ■ Extracting data from the original sources■ Quality assuring and cleaning data■ Conforming the labels and measures in the data to achieve consistency across the original sources■ Delivering data in a physical format that can be used by query tools, report writers, and dashboards. Source: Ralph Kimball – The Data Warehouse ETL Toolkit
  • 11. Source Staging Area Operational Data Store DatamartsSystems structured documents databases Temporary Staging Area APIs staging relational dimensional L0 L1 L2
  • 12. real time = daily
  • 13. Multi-dimensional Modeling
  • 14. aggregation browsing slicing and dicing
  • 15. business / analyst’s point of viewregardless of physical schema implementation
  • 16. Facts measurable fact fact data cellmost detailed information
  • 17. locationtype time dimensions
  • 18. Dimension■ provide context for facts■ used to filter queries or reports■ control scope of aggregation of facts
  • 19. Pentaho
  • 20. Python and Data community perception* *as of Oct 2012
  • 21. Scientific & Financial
  • 22. Python
  • 23. Data Analysis and Extraction, Transformation, LoadingSources Presentation Data Governance Technologies and Utilities
  • 24. Scientific Data T1[s] T2[s] T3[s] T4[s]P1 112,68 941,67 171,01 660,48P2 96,15 306,51 725,88 877,82P3 313,39 189,31 41,81 428,68P4 760,62 983,48 371,21 281,19P5 838,56 39,27 389,42 231,12 n-dimensional array of numbers
  • 25. Assumptions■ data is mostly numbers■ data is neatly organized...■ … in one multi-dimensional array
  • 26. Data Analysis and Extraction, Transformation, LoadingSources Presentation Data Governance Technologies and Utilities
  • 27. Business Data
  • 28. multiple snapshots of one sourcemultiple representations categories are of same data changing
  • 29.
  • 30. Is Python Capable? very basic examples
  • 31. Data Pipes with SQLAlchemy Data Analysis and Extraction, Transformation, LoadingSources Presentation Data Governance Technologies and Utilities
  • 32. ■ connection: create_engine■ schema reflection: MetaData, Table■ expressions: select(), insert()
  • 33. src_engine = create_engine("sqlite:///data.sqlite")src_metadata = MetaData(bind=src_engine)src_table = Table(data, src_metadata, autoload=True)target_engine = create_engine("postgres://localhost/sandbox")target_metadata = MetaData(bind=target_engine)target_table = Table(data, target_metadata)
  • 34. clone schema:for column in src_table.columns: target_table.append_column(column.copy())target_table.create()copy data:insert = target_table.insert()for row in src_table.select().execute(): insert.execute(row)
  • 35. magic used:metadata reflection
  • 36. text file (CSV) to table:reader = csv.reader(file_stream)columns = reader.next()for column in columns: table.append_column(Column(column, String))table.create()for row in reader: insert.execute(row)
  • 37. Simple T from ETL Data Analysis and Extraction, Transformation, LoadingSources Presentation Data Governance Technologies and Utilities
  • 38. transformation = [ (fiscal_year, {"w function": int, ". field":"fiscal_year"}), (region_code, {"4 mapping": region_map, ". field":"region"}), (borrower_country, None), (project_name, None), (procurement_type, None), (major_sector_code, {"4 mapping": sector_code_map, ". field":"major_sector"}), (major_sector, None), (supplier, None), (contract_amount, {"w function": currency_to_number, ". field": total_contract_amount} ] target fields source transformations
  • 39. Transformationfor row in source: result = transform(row, [ transformation) table.insert(result).execute()
  • 40. OLAP with Cubes Data Analysis and Extraction, Transformation, LoadingSources Presentation Data Governance Technologies and Utilities
  • 41. Model { “name” = “My Model” “description” = .... “cubes” = [...] “dimensions” = [...] }cubes dimensionsmeasures levels, attributes, hierarchy
  • 42. logical physical ❄
  • 43. 1 load_model("model.json") Application ∑ 3 model.cube("sales") 4 workspace.browser(cube) cubes Aggregation Browser backend2 create_workspace("sql", model, url="sqlite:///data.sqlite")
  • 44. browser.aggregate(o cell, . drilldown=[9 "sector"]) drill-down
  • 45. for row in result.table_rows(“sector”): row.record["amount_sum"]q row.label k row.key
  • 46. whole cube o cell = Cell(cube) browser.aggregate(o cell) Total browser.aggregate(o cell, drilldown=[9 “date”])2006 2007 2008 2009 2010 ✂ cut = PointCut(9 “date”, [2010]) o cell = o cell.slice(✂ cut) browser.aggregate(o cell, drilldown=[9 “date”])Jan Feb Mar Apr March April May ...
  • 47. How can Python be Useful
  • 48. just the Language ■ saves maintenance resources ■ shortens development time ■ saves your from going insane
  • 49. Source Staging Area Operational Data Store DatamartsSystems structured documents databases faster Temporary Staging Area APIs staging relational dimensional L0 L1 L2
  • 50. faster advanced Data Analysis and Extraction, Transformation, LoadingSources Presentation Data Governance Technologies and Utilities understandable, maintainable
  • 51. Conclusion
  • 52. BI is about… peopletechnology processes
  • 53. don’t forget metadata
  • 54. Futurewho is going to fix your COBOL Java tool if you have only Python guys around?
  • 55. is capable, let’s start
  • 56. Thank You [t Twitter: @Stiivi DataBrewery blog:blog.databrewery.org Github: github.com/Stiivi