Successfully reported this slideshow.

Python business intelligence (PyData 2012 talk)

66

Share

Loading in …3
×
1 of 58
1 of 58

Python business intelligence (PyData 2012 talk)

66

Share

Download to read offline

What is the state of business intelligence tools in Python in 2012? How Python is used for data processing and analysis? Different approaches for business data and scientific data.

Video: https://vimeo.com/53063944

What is the state of business intelligence tools in Python in 2012? How Python is used for data processing and analysis? Different approaches for business data and scientific data.

Video: https://vimeo.com/53063944

More Related Content

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Python business intelligence (PyData 2012 talk)

  1. 1. Python for Business Intelligence Štefan Urbánek ■ @Stiivi ■ stefan.urbanek@continuum.io ■ PyData NYC, October 2012
  2. 2. python business intelligence )
  3. 3. Results Q/A and articles with Java solution references (not listed here)
  4. 4. Why?
  5. 5. Overview ■ Traditional Data Warehouse ■ Python and Data ■ Is Python Capable? ■ Conclusion
  6. 6. Business Intelligence
  7. 7. people technology processes
  8. 8. Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  9. 9. Traditional Data Warehouse
  10. 10. ■ Extracting data from the original sources ■ Quality assuring and cleaning data ■ Conforming the labels and measures in the data to achieve consistency across the original sources ■ Delivering data in a physical format that can be used by query tools, report writers, and dashboards. Source: Ralph Kimball – The Data Warehouse ETL Toolkit
  11. 11. Source Staging Area Operational Data Store Datamarts Systems structured documents databases Temporary Staging Area APIs staging relational dimensional L0 L1 L2
  12. 12. real time = daily
  13. 13. Multi-dimensional Modeling
  14. 14. aggregation browsing slicing and dicing
  15. 15. business / analyst’s point of view regardless of physical schema implementation
  16. 16. Facts measurable fact fact data cell most detailed information
  17. 17. location type time dimensions
  18. 18. Dimension ■ provide context for facts ■ used to filter queries or reports ■ control scope of aggregation of facts
  19. 19. Pentaho
  20. 20. Python and Data community perception* *as of Oct 2012
  21. 21. Scientific & Financial
  22. 22. Python
  23. 23. Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  24. 24. Scientific Data T1[s] T2[s] T3[s] T4[s] P1 112,68 941,67 171,01 660,48 P2 96,15 306,51 725,88 877,82 P3 313,39 189,31 41,81 428,68 P4 760,62 983,48 371,21 281,19 P5 838,56 39,27 389,42 231,12 n-dimensional array of numbers
  25. 25. Assumptions ■ data is mostly numbers ■ data is neatly organized... ■ … in one multi-dimensional array
  26. 26. Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  27. 27. Business Data
  28. 28. multiple snapshots of one source multiple representations categories are of same data changing
  29. 29.
  30. 30. Is Python Capable? very basic examples
  31. 31. Data Pipes with SQLAlchemy Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  32. 32. ■ connection: create_engine ■ schema reflection: MetaData, Table ■ expressions: select(), insert()
  33. 33. src_engine = create_engine("sqlite:///data.sqlite") src_metadata = MetaData(bind=src_engine) src_table = Table('data', src_metadata, autoload=True) target_engine = create_engine("postgres://localhost/sandbox") target_metadata = MetaData(bind=target_engine) target_table = Table('data', target_metadata)
  34. 34. clone schema: for column in src_table.columns: target_table.append_column(column.copy()) target_table.create() copy data: insert = target_table.insert() for row in src_table.select().execute(): insert.execute(row)
  35. 35. magic used: metadata reflection
  36. 36. text file (CSV) to table: reader = csv.reader(file_stream) columns = reader.next() for column in columns: table.append_column(Column(column, String)) table.create() for row in reader: insert.execute(row)
  37. 37. Simple T from ETL Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  38. 38. transformation = [ ('fiscal_year', {"w function": int, ". field":"fiscal_year"}), ('region_code', {"4 mapping": region_map, ". field":"region"}), ('borrower_country', None), ('project_name', None), ('procurement_type', None), ('major_sector_code', {"4 mapping": sector_code_map, ". field":"major_sector"}), ('major_sector', None), ('supplier', None), ('contract_amount', {"w function": currency_to_number, ". field": 'total_contract_amount'} ] target fields source transformations
  39. 39. Transformation for row in source: result = transform(row, [ transformation) table.insert(result).execute()
  40. 40. OLAP with Cubes Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  41. 41. Model { “name” = “My Model” “description” = .... “cubes” = [...] “dimensions” = [...] } cubes dimensions measures levels, attributes, hierarchy
  42. 42. logical physical ❄
  43. 43. 1 load_model("model.json") Application ∑ 3 model.cube("sales") 4 workspace.browser(cube) cubes Aggregation Browser backend 2 create_workspace("sql", model, url="sqlite:///data.sqlite")
  44. 44. browser.aggregate(o cell, . drilldown=[9 "sector"]) drill-down
  45. 45. for row in result.table_rows(“sector”): row.record["amount_sum"] q row.label k row.key
  46. 46. whole cube o cell = Cell(cube) browser.aggregate(o cell) Total browser.aggregate(o cell, drilldown=[9 “date”]) 2006 2007 2008 2009 2010 ✂ cut = PointCut(9 “date”, [2010]) o cell = o cell.slice(✂ cut) browser.aggregate(o cell, drilldown=[9 “date”]) Jan Feb Mar Apr March April May ...
  47. 47. How can Python be Useful
  48. 48. just the Language ■ saves maintenance resources ■ shortens development time ■ saves your from going insane
  49. 49. Source Staging Area Operational Data Store Datamarts Systems structured documents databases faster Temporary Staging Area APIs staging relational dimensional L0 L1 L2
  50. 50. faster advanced Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities understandable, maintainable
  51. 51. Conclusion
  52. 52. BI is about… people technology processes
  53. 53. don’t forget metadata
  54. 54. Future who is going to fix your COBOL Java tool if you have only Python guys around?
  55. 55. is capable, let’s start
  56. 56. Thank You [t Twitter: @Stiivi DataBrewery blog: blog.databrewery.org Github: github.com/Stiivi

×