Advertisement
Advertisement

More Related Content

Similar to Python business intelligence (PyData 2012 talk)(20)

Advertisement

Python business intelligence (PyData 2012 talk)

  1. Python for Business Intelligence Štefan Urbánek ■ @Stiivi ■ stefan.urbanek@continuum.io ■ PyData NYC, October 2012
  2. python business intelligence )
  3. Results Q/A and articles with Java solution references (not listed here)
  4. Why?
  5. Overview ■ Traditional Data Warehouse ■ Python and Data ■ Is Python Capable? ■ Conclusion
  6. Business Intelligence
  7. people technology processes
  8. Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  9. Traditional Data Warehouse
  10. ■ Extracting data from the original sources ■ Quality assuring and cleaning data ■ Conforming the labels and measures in the data to achieve consistency across the original sources ■ Delivering data in a physical format that can be used by query tools, report writers, and dashboards. Source: Ralph Kimball – The Data Warehouse ETL Toolkit
  11. Source Staging Area Operational Data Store Datamarts Systems structured documents databases Temporary Staging Area APIs staging relational dimensional L0 L1 L2
  12. real time = daily
  13. Multi-dimensional Modeling
  14. aggregation browsing slicing and dicing
  15. business / analyst’s point of view regardless of physical schema implementation
  16. Facts measurable fact fact data cell most detailed information
  17. location type time dimensions
  18. Dimension ■ provide context for facts ■ used to filter queries or reports ■ control scope of aggregation of facts
  19. Pentaho
  20. Python and Data community perception* *as of Oct 2012
  21. Scientific & Financial
  22. Python
  23. Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  24. Scientific Data T1[s] T2[s] T3[s] T4[s] P1 112,68 941,67 171,01 660,48 P2 96,15 306,51 725,88 877,82 P3 313,39 189,31 41,81 428,68 P4 760,62 983,48 371,21 281,19 P5 838,56 39,27 389,42 231,12 n-dimensional array of numbers
  25. Assumptions ■ data is mostly numbers ■ data is neatly organized... ■ … in one multi-dimensional array
  26. Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  27. Business Data
  28. multiple snapshots of one source multiple representations categories are of same data changing
  29. Is Python Capable? very basic examples
  30. Data Pipes with SQLAlchemy Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  31. ■ connection: create_engine ■ schema reflection: MetaData, Table ■ expressions: select(), insert()
  32. src_engine = create_engine("sqlite:///data.sqlite") src_metadata = MetaData(bind=src_engine) src_table = Table('data', src_metadata, autoload=True) target_engine = create_engine("postgres://localhost/sandbox") target_metadata = MetaData(bind=target_engine) target_table = Table('data', target_metadata)
  33. clone schema: for column in src_table.columns: target_table.append_column(column.copy()) target_table.create() copy data: insert = target_table.insert() for row in src_table.select().execute(): insert.execute(row)
  34. magic used: metadata reflection
  35. text file (CSV) to table: reader = csv.reader(file_stream) columns = reader.next() for column in columns: table.append_column(Column(column, String)) table.create() for row in reader: insert.execute(row)
  36. Simple T from ETL Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  37. transformation = [ ('fiscal_year', {"w function": int, ". field":"fiscal_year"}), ('region_code', {"4 mapping": region_map, ". field":"region"}), ('borrower_country', None), ('project_name', None), ('procurement_type', None), ('major_sector_code', {"4 mapping": sector_code_map, ". field":"major_sector"}), ('major_sector', None), ('supplier', None), ('contract_amount', {"w function": currency_to_number, ". field": 'total_contract_amount'} ] target fields source transformations
  38. Transformation for row in source: result = transform(row, [ transformation) table.insert(result).execute()
  39. OLAP with Cubes Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  40. Model { “name” = “My Model” “description” = .... “cubes” = [...] “dimensions” = [...] } cubes dimensions measures levels, attributes, hierarchy
  41. logical physical ❄
  42. 1 load_model("model.json") Application ∑ 3 model.cube("sales") 4 workspace.browser(cube) cubes Aggregation Browser backend 2 create_workspace("sql", model, url="sqlite:///data.sqlite")
  43. browser.aggregate(o cell, . drilldown=[9 "sector"]) drill-down
  44. for row in result.table_rows(“sector”): row.record["amount_sum"] q row.label k row.key
  45. whole cube o cell = Cell(cube) browser.aggregate(o cell) Total browser.aggregate(o cell, drilldown=[9 “date”]) 2006 2007 2008 2009 2010 ✂ cut = PointCut(9 “date”, [2010]) o cell = o cell.slice(✂ cut) browser.aggregate(o cell, drilldown=[9 “date”]) Jan Feb Mar Apr March April May ...
  46. How can Python be Useful
  47. just the Language ■ saves maintenance resources ■ shortens development time ■ saves your from going insane
  48. Source Staging Area Operational Data Store Datamarts Systems structured documents databases faster Temporary Staging Area APIs staging relational dimensional L0 L1 L2
  49. faster advanced Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities understandable, maintainable
  50. Conclusion
  51. BI is about… people technology processes
  52. don’t forget metadata
  53. Future who is going to fix your COBOL Java tool if you have only Python guys around?
  54. is capable, let’s start
  55. Thank You [t Twitter: @Stiivi DataBrewery blog: blog.databrewery.org Github: github.com/Stiivi
Advertisement