Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Python for Business        IntelligenceŠtefan Urbánek ■ @Stiivi ■ stefan.urbanek@continuum.io ■ PyData NYC, October 2012
python business intelligence                )
ResultsQ/A and articles with Java  solution references               (not listed here)
Why?
Overview■ Traditional Data Warehouse■ Python and Data■ Is Python Capable?■ Conclusion
BusinessIntelligence
peopletechnology processes
Data                                           Analysis and          Extraction, Transformation, LoadingSources           ...
Traditional Data  Warehouse
■ Extracting data from the original sources■ Quality assuring and cleaning data■ Conforming the labels and measures   in t...
Source               Staging Area     Operational Data Store   DatamartsSystems   structured   documents   databases      ...
real time = daily
Multi-dimensional    Modeling
aggregation browsing     slicing and dicing
business / analyst’s       point of viewregardless of physical schema implementation
Facts                  measurable     fact                   fact data cellmost detailed information
locationtype              time           dimensions
Dimension■ provide context for facts■ used to filter queries or reports■ control scope of aggregation of facts
Pentaho
Python and Data   community perception*                           *as of Oct 2012
Scientific & Financial
Python
Data                                           Analysis and          Extraction, Transformation, LoadingSources           ...
Scientific Data      T1[s]     T2[s]     T3[s]     T4[s]P1     112,68    941,67    171,01    660,48P2      96,15    306,51 ...
Assumptions■ data is mostly numbers■ data is neatly organized...■ … in one multi-dimensional array
Data                                           Analysis and          Extraction, Transformation, LoadingSources           ...
Business Data
multiple snapshots of one sourcemultiple representations     categories are     of same data                  changing
❄
Is Python Capable?     very basic examples
Data Pipes with   SQLAlchemy Data                                           Analysis and          Extraction, Transformati...
■ connection: create_engine■ schema reflection: MetaData,   Table■ expressions: select(),   insert()
src_engine = create_engine("sqlite:///data.sqlite")src_metadata = MetaData(bind=src_engine)src_table = Table(data, src_met...
clone schema:for column in src_table.columns:    target_table.append_column(column.copy())target_table.create()copy data:i...
magic used:metadata reflection
text file (CSV) to table:reader = csv.reader(file_stream)columns = reader.next()for column in columns:    table.append_colu...
Simple T from ETL Data                                           Analysis and          Extraction, Transformation, Loading...
transformation = [ (fiscal_year,         {"w function": int,                          ". field":"fiscal_year"}), (region_c...
Transformationfor row in source:    result = transform(row, [ transformation)    table.insert(result).execute()
OLAP with Cubes Data                                           Analysis and          Extraction, Transformation, LoadingSo...
Model           {               “name” = “My Model”               “description” = ....               “cubes” = [...]      ...
logical              physical          ❄
1   load_model("model.json")           Application                  ∑                                 3   model.cube("sale...
browser.aggregate(o cell,                  . drilldown=[9 "sector"])                        drill-down
for row in result.table_rows(“sector”):          row.record["amount_sum"]q row.label                     k row.key
whole cube                                           o cell = Cell(cube)                                           browser...
How can Python  be Useful
just the   Language ■ saves maintenance resources ■ shortens development time ■ saves your from going insane
Source               Staging Area      Operational Data Store   DatamartsSystems   structured   documents   databases     ...
faster                      advanced Data                                            Analysis and          Extraction, Tra...
Conclusion
BI is about…       peopletechnology processes
don’t forget metadata
Futurewho is going to fix your COBOL Java tool if you have only Python guys around?
is capable, let’s start
Thank You      [t          Twitter:        @Stiivi     DataBrewery blog:blog.databrewery.org          Github:  github.com/...
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
Upcoming SlideShare
Loading in …5
×

Python business intelligence (PyData 2012 talk)

22,056 views

Published on

What is the state of business intelligence tools in Python in 2012? How Python is used for data processing and analysis? Different approaches for business data and scientific data.

Video: https://vimeo.com/53063944

Python business intelligence (PyData 2012 talk)

  1. 1. Python for Business IntelligenceŠtefan Urbánek ■ @Stiivi ■ stefan.urbanek@continuum.io ■ PyData NYC, October 2012
  2. 2. python business intelligence )
  3. 3. ResultsQ/A and articles with Java solution references (not listed here)
  4. 4. Why?
  5. 5. Overview■ Traditional Data Warehouse■ Python and Data■ Is Python Capable?■ Conclusion
  6. 6. BusinessIntelligence
  7. 7. peopletechnology processes
  8. 8. Data Analysis and Extraction, Transformation, LoadingSources Presentation Data Governance Technologies and Utilities
  9. 9. Traditional Data Warehouse
  10. 10. ■ Extracting data from the original sources■ Quality assuring and cleaning data■ Conforming the labels and measures in the data to achieve consistency across the original sources■ Delivering data in a physical format that can be used by query tools, report writers, and dashboards. Source: Ralph Kimball – The Data Warehouse ETL Toolkit
  11. 11. Source Staging Area Operational Data Store DatamartsSystems structured documents databases Temporary Staging Area APIs staging relational dimensional L0 L1 L2
  12. 12. real time = daily
  13. 13. Multi-dimensional Modeling
  14. 14. aggregation browsing slicing and dicing
  15. 15. business / analyst’s point of viewregardless of physical schema implementation
  16. 16. Facts measurable fact fact data cellmost detailed information
  17. 17. locationtype time dimensions
  18. 18. Dimension■ provide context for facts■ used to filter queries or reports■ control scope of aggregation of facts
  19. 19. Pentaho
  20. 20. Python and Data community perception* *as of Oct 2012
  21. 21. Scientific & Financial
  22. 22. Python
  23. 23. Data Analysis and Extraction, Transformation, LoadingSources Presentation Data Governance Technologies and Utilities
  24. 24. Scientific Data T1[s] T2[s] T3[s] T4[s]P1 112,68 941,67 171,01 660,48P2 96,15 306,51 725,88 877,82P3 313,39 189,31 41,81 428,68P4 760,62 983,48 371,21 281,19P5 838,56 39,27 389,42 231,12 n-dimensional array of numbers
  25. 25. Assumptions■ data is mostly numbers■ data is neatly organized...■ … in one multi-dimensional array
  26. 26. Data Analysis and Extraction, Transformation, LoadingSources Presentation Data Governance Technologies and Utilities
  27. 27. Business Data
  28. 28. multiple snapshots of one sourcemultiple representations categories are of same data changing
  29. 29.
  30. 30. Is Python Capable? very basic examples
  31. 31. Data Pipes with SQLAlchemy Data Analysis and Extraction, Transformation, LoadingSources Presentation Data Governance Technologies and Utilities
  32. 32. ■ connection: create_engine■ schema reflection: MetaData, Table■ expressions: select(), insert()
  33. 33. src_engine = create_engine("sqlite:///data.sqlite")src_metadata = MetaData(bind=src_engine)src_table = Table(data, src_metadata, autoload=True)target_engine = create_engine("postgres://localhost/sandbox")target_metadata = MetaData(bind=target_engine)target_table = Table(data, target_metadata)
  34. 34. clone schema:for column in src_table.columns: target_table.append_column(column.copy())target_table.create()copy data:insert = target_table.insert()for row in src_table.select().execute(): insert.execute(row)
  35. 35. magic used:metadata reflection
  36. 36. text file (CSV) to table:reader = csv.reader(file_stream)columns = reader.next()for column in columns: table.append_column(Column(column, String))table.create()for row in reader: insert.execute(row)
  37. 37. Simple T from ETL Data Analysis and Extraction, Transformation, LoadingSources Presentation Data Governance Technologies and Utilities
  38. 38. transformation = [ (fiscal_year, {"w function": int, ". field":"fiscal_year"}), (region_code, {"4 mapping": region_map, ". field":"region"}), (borrower_country, None), (project_name, None), (procurement_type, None), (major_sector_code, {"4 mapping": sector_code_map, ". field":"major_sector"}), (major_sector, None), (supplier, None), (contract_amount, {"w function": currency_to_number, ". field": total_contract_amount} ] target fields source transformations
  39. 39. Transformationfor row in source: result = transform(row, [ transformation) table.insert(result).execute()
  40. 40. OLAP with Cubes Data Analysis and Extraction, Transformation, LoadingSources Presentation Data Governance Technologies and Utilities
  41. 41. Model { “name” = “My Model” “description” = .... “cubes” = [...] “dimensions” = [...] }cubes dimensionsmeasures levels, attributes, hierarchy
  42. 42. logical physical ❄
  43. 43. 1 load_model("model.json") Application ∑ 3 model.cube("sales") 4 workspace.browser(cube) cubes Aggregation Browser backend2 create_workspace("sql", model, url="sqlite:///data.sqlite")
  44. 44. browser.aggregate(o cell, . drilldown=[9 "sector"]) drill-down
  45. 45. for row in result.table_rows(“sector”): row.record["amount_sum"]q row.label k row.key
  46. 46. whole cube o cell = Cell(cube) browser.aggregate(o cell) Total browser.aggregate(o cell, drilldown=[9 “date”])2006 2007 2008 2009 2010 ✂ cut = PointCut(9 “date”, [2010]) o cell = o cell.slice(✂ cut) browser.aggregate(o cell, drilldown=[9 “date”])Jan Feb Mar Apr March April May ...
  47. 47. How can Python be Useful
  48. 48. just the Language ■ saves maintenance resources ■ shortens development time ■ saves your from going insane
  49. 49. Source Staging Area Operational Data Store DatamartsSystems structured documents databases faster Temporary Staging Area APIs staging relational dimensional L0 L1 L2
  50. 50. faster advanced Data Analysis and Extraction, Transformation, LoadingSources Presentation Data Governance Technologies and Utilities understandable, maintainable
  51. 51. Conclusion
  52. 52. BI is about… peopletechnology processes
  53. 53. don’t forget metadata
  54. 54. Futurewho is going to fix your COBOL Java tool if you have only Python guys around?
  55. 55. is capable, let’s start
  56. 56. Thank You [t Twitter: @Stiivi DataBrewery blog:blog.databrewery.org Github: github.com/Stiivi

×