Python business intelligence (PyData 2012 talk)

Python for Business
Intelligence

Štefan Urbánek ■ @Stiivi ■ stefan.urbanek@continuum.io ■ PyData NYC, October 2012

python business intelligence

)

Results

Q/A and articles with Java
solution references

(not listed here)

Overview

■ Traditional Data Warehouse
■ Python and Data
■ Is Python Capable?
■ Conclusion

Data Analysis and
Extraction, Transformation, Loading
Sources Presentation

Data Governance

Technologies and Utilities

■ Extracting data from the original sources

■ Quality assuring and cleaning data

■ Conforming the labels and measures
in the data to achieve consistency across the original sources

■ Delivering data in a physical format that can be used by
query tools, report writers, and dashboards.

Source: Ralph Kimball – The Data Warehouse ETL Toolkit

Source Staging Area Operational Data Store Datamarts
Systems

structured
documents

databases

Temporary
Staging
Area
APIs

staging relational dimensional

L0 L1 L2

Multi-dimensional
Modeling

aggregation browsing
slicing and dicing

business / analyst’s
point of view

regardless of physical schema implementation

Facts

measurable

fact

fact data cell

most detailed information

location

type

time

dimensions

Dimension

■ provide context for facts
■ used to ﬁlter queries or reports
■ control scope of aggregation of facts

Python and Data
community perception*

*as of Oct 2012

Scientiﬁc Data
T1[s] T2[s] T3[s] T4[s]
P1 112,68 941,67 171,01 660,48

P2 96,15 306,51 725,88 877,82

P3 313,39 189,31 41,81 428,68

P4 760,62 983,48 371,21 281,19

P5 838,56 39,27 389,42 231,12

n-dimensional array of numbers

Assumptions

■ data is mostly numbers
■ data is neatly organized...
■ … in one multi-dimensional array

multiple snapshots of one source

multiple representations categories are

of same data changing

Is Python Capable?
very basic examples

Data Pipes with
SQLAlchemy

Data Analysis and

Data Governance


■ connection: create_engine
■ schema reﬂection: MetaData, Table

■ expressions: select(), insert()

src_engine = create_engine("sqlite:///data.sqlite")
src_metadata = MetaData(bind=src_engine)
src_table = Table('data', src_metadata, autoload=True)

target_engine = create_engine("postgres://localhost/sandbox")
target_metadata = MetaData(bind=target_engine)
target_table = Table('data', target_metadata)

clone schema:

for column in src_table.columns:
target_table.append_column(column.copy())

target_table.create()

copy data:

insert = target_table.insert()

for row in src_table.select().execute():
insert.execute(row)

magic used:

metadata reﬂection

text ﬁle (CSV) to table:

reader = csv.reader(file_stream)

columns = reader.next()

for column in columns:
table.append_column(Column(column, String))

table.create()

for row in reader:
insert.execute(row)

Simple T from ETL

Data Analysis and

Data Governance


transformation = [

('fiscal_year', {"w function": int,
". field":"fiscal_year"}),
('region_code', {"4 mapping": region_map,
". field":"region"}),
('borrower_country', None),
('project_name', None),
('procurement_type', None),
('major_sector_code', {"4 mapping": sector_code_map,
". field":"major_sector"}),
('major_sector', None),
('supplier', None),
('contract_amount', {"w function": currency_to_number,
". field": 'total_contract_amount'}
]

target fields source transformations

Transformation

for row in source:
result = transform(row, [ transformation)
table.insert(result).execute()

OLAP with Cubes

Data Analysis and

Data Governance


Model
{
“name” = “My Model”
“description” = ....

“cubes” = [...]
“dimensions” = [...]
}

cubes dimensions
measures levels, attributes, hierarchy

logical

physical

❄

1 load_model("model.json")

Application

∑

3 model.cube("sales")
4 workspace.browser(cube)

cubes

Aggregation Browser
backend

2 create_workspace("sql",
model,
url="sqlite:///data.sqlite")

browser.aggregate(o cell,
. drilldown=[9 "sector"])

drill-down

for row in result.table_rows(“sector”):

row.record["amount_sum"]
q row.label k row.key

whole cube

o cell = Cell(cube)
browser.aggregate(o cell)
Total

drilldown=[9 “date”])

2006 2007 2008 2009 2010

✂ cut = PointCut(9 “date”, [2010])
o cell = o cell.slice(✂ cut)

drilldown=[9 “date”])
Jan Feb Mar Apr March April May ...

just the Language
■ saves maintenance resources
■ shortens development time
■ saves your from going insane

Source Staging Area Operational Data Store Datamarts
Systems

structured
documents

databases
faster
Temporary
Staging
Area
APIs

staging relational dimensional

L0 L1 L2

faster advanced

Data Analysis and

Data Governance


understandable, maintainable

BI is about…

people

technology processes

Future

who is going to ﬁx your COBOL Java tool
if you have only Python guys around?

Thank You
[t

Twitter:

@Stiivi
DataBrewery blog:

blog.databrewery.org
Github:

github.com/Stiivi

Python business intelligence (PyData 2012 talk)

More Related Content

Similar to Python business intelligence (PyData 2012 talk)

More from Stefan Urbanek

Python business intelligence (PyData 2012 talk)