Prespective analytics with DOcplex and pandas

© 2016 IBM Corporation
Prescriptive Analytics with
DOcplex and pandas
Hugues JUILLE

© 2016 IBM Corporation2
Agenda
• What is Prescriptive Analytics?
• Why Python for Prescriptive Analytics?
• DOcplex: What is it?
• Using DOcplex for modelling an Optimization problem
• Using pandas for improved modelling capabilities

What is Prescriptive Analysis?
 Also known as:
Decision Optimization
 Prescriptive analytics is about:
 recommending actions,
 based on desired outcomes,
 taking into account :
• specific scenarios,
• limited resources and
• knowledge of past and current events.
 This insight can help organizations make better decisions and have greater
control of business outcomes.
Prescriptive
Analytics
Descriptive
Analytics
Predictive
Analytics
How can we
make it happen?
What will
happen?
What
happened?
Difficulty
Value

The Science of Better Decisions
What to build,
where and when?
How to best allocate
aircrafts and crews?
Risk vs. potential reward
Inventory cost vs.
customer satisfaction
Cost vs.carbon
emission
Optimization helps businesses:
• create the best possible plans
• explore alternatives and understand trade-off
• respond to changes in business operations

How does Optimization work?

What is an optimization model?
An optimization model is
composed of:
• Decision variables
• Constraints
• An objective function
Solving a model means:
Finding an assignment to
decision variables that:
• minimize or maximize the
objective function,
• subject to meeting all
constraints
A Constraints Programming
(CP) model:
• Based on higher level constructs:
• Discrete or interval variables
• Rich set of logical, arithmetic or
(non-linear) functional constraints
over variables
• Dedicated to combinatorial /
scheduling problems
A Mathematical Programming
model:

Agenda

Modelling languages for Prescriptive Analytics
Input data definition
Decision variables: How much to
produce for each product
Objective: maximize profit
Constraints: demand for components
cannot exceed stock
 Modelling languages for Prescriptive Analytics: AMPL, GAMMS, OPL…
 Enable concise formulations close to mathematical language, intensive use
of matrices representation…
𝑃𝑟𝑜𝑓𝑖𝑡 𝑝 × 𝑃𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑜𝑛 𝑝
𝑝
∀𝑐, 𝐷𝑒𝑚𝑎𝑛𝑑 𝑝,𝑐 × 𝑃𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑜𝑛 𝑝 ≤ 𝑆𝑡𝑜𝑐𝑘 𝑐
𝑝
𝑃𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑜𝑛 𝑝 ≥ 0

Why Python for Prescriptive Analytics?
 Take advantage of Python expressiveness (generators, aggregators,
operator overloading, tuples…).
 Python capabilities make it a viable alternative to specialized modelling
languages:
 1 single language to create the constraints AND do the workflow.
 Standard libraries with abstract constructs to manipulate: vectors,
matrices, relational data model…
 Ecosystem, ease of use, proven robustness, data ingest
 Workflow and mathematical description are part of the language, no
memory management

Why Python for Prescriptive Analytics?
Core Python libraries for scientific people
Notebooks = great technology for prototyping optimization
models in an interactive way
Leverage Big Data tools, such as Apache Spark.

Agenda

DOcplex: What is it? How to get it?
• Easily formulate your optimization models and solve them with IBM Decision Optimization on the
Cloud solve service or CPLEX local solver (with 0 code change).
• Access to free solve capabilities to discover this new API is made easy thanks to our cloud free trial
and our new CPLEX Optimization Studio free Community Edition (aka COS CE): you can get access
to any of those two with the help of one mail address.
• Available through the standard Python pip install with no need to download anything else or
contact any IBM person if you go full cloud.
• Just look for docplex in your browser to get access to docplex pypi repo or doc.

Comprehensive documentation and resources
All documentation and resources are available on-line
Educative: examples / cookbooks for all levels of expertise:
Discovering IBM Decision Optimization technologies…
…Reference manuals for APIs
Social: community / forums

DOcplex for optimization modelling (MP)
Import DOcplex MP package import docplex.mp
Create the container for your model
Define decision variables
(individually or as collections,
discrete or continuous)
Define constraints over variables
Define objective
Solve the model using local Cplex
or on the cloud
mdl = Model('Warehouse')
x = mdl.add_continuous_var('totDmd')
supply_vars =
mdl.binary_var_matrix(warehouses,
stores, 'supply')
mdl.add_constraint(supply_vars[w, s] <=
open_vars[w])
mdl.add_constraint(mdl.sum(supply_vars[w,
s] for s in stores) <= w.capacity)
mdl.minimize(total_opening_cost +
total_supply_cost)
mdl.solve()
mdl.solve(url=SVC_URL, key=SVC_KEY)

Agenda

DOcplex and Notebooks for Optimization

Installing DOcplex and configuring your credentials

Easy to download and parse json

Visualizing the input data

Agenda

Slicing and Aggregate constructs
 Two important constructs to describe complex problems in a compact form:
 Slicing filters: select a subset of items in a multi-dimensional collection
 Aggregate:
• used in combination with slicing,
• build the actual mathematical expression
OPL:
DOcplex:
forall ( l in leg_ids, we in weeks )
leg_teu[l][we] == sum (tv in trans_vars : tv.l.leg_id == l && w[tv.date] == we)
trans[tv] * size [tv.eqc];
for l in leg_ids:
for we in weeks:
mdl.add_constraint(
leg_teu[(l, we)] == mdl.sum(trans[(tv.leg_id, tv.mot, tv.date, tv.eqc)] *
size[tv.eqc] for tv in trans_vars_list
if tv.leg_id == l and w[tv.date] == we))

Performance considerations
 Runtime model generation should be as effective as possible:
 may be invoked thousands of time when running in production
 large models may involve millions of variables and constraints
 “naïve” translation of slicing/aggregate in Python can be very inefficient when
nested loops are involved
 Use pandas for handling slicing on large collections
“pandas is an open source library providing high-performance, easy-to-
use data structures and data analysis tools for the Python programming
language”
 DOcplex can benefit of the following pandas features:
 Data organized in multi-indexed tables
 Efficient merge operations between tables
 Efficient indexing, filtering and grouping operations on tables

 Data Frame
trans_df:
 “naïve” slicing:
--> Elapsed time: 5875 ms
 Slicing with pandas:
with SimpleTimer("TEU EQUATIONS-3", print_details=False):
for l in leg_ids:
for we in weeks:
mdl.add_constraint(
leg_teu[(l, we)] == mdl.sum(t.trans * size[t.eqc] for t in trans_df_list
if t.leg_id == l and w[t.date] == we))
trans_df['week'] = trans_df.apply(lambda row: w[row.date], axis=1)
for l in leg_ids:
for we in weeks:
slice_df = trans_df.loc[(trans_df.leg_id == l) & (trans_df.week == we)]
mdl.add_constraint(
leg_teu[(l, we)] == mdl.sum(t.trans * size[t.eqc]
for t in slice_df.itertuples()))
trans eqc leg_id mot date week
@trans_01 DRY-20 CDC-BOR Truck 10/06/16 23
@trans_02 HIGH-40 CHE-MAR Train 10/06/16 23
… … … … … …

 Issue with this formulation:
 Slicing is calculated inside the nested loops
 cost of creating a pandas Data Frame is incurred at each iteration
 Much better strategy:
 Prepare the results of all slicing filters before entering the nested loops
 This can be done thanks to pandas’ groupby and aggregate operations
for l in leg_ids:
for we in weeks:
slice_df = trans_df.loc[(trans_df.leg_id == l) & (trans_df.week == we)]
mdl.add_constraint(
leg_teu[(l, we)] == mdl.sum(t.trans * size[t.eqc]
for t in slice_df.itertuples()))

 Prepare all results of slicing beforehand:
 Based on two pandas operations:
 groupby: split dataset into groups
 aggregate: perform a computation on the grouped data
trans_df['result'] = trans_df.apply(lambda row: row.trans * size[row.eqc], axis=1)
legWeeksMultiIndex = pd.MultiIndex.from_product([leg_ids, weeks], names=["leg_id", "week"])
legWeeksMultiIndex_df = pd.DataFrame(legWeeksMultiIndex.values.tolist(),
columns=["leg_id", "week"])
trans_full_df = legWeeksMultiIndex_df.merge(trans_df, how='left').fillna(0)
trans_sum_grpby = trans_full_df[['leg_id', 'week', 'result']].groupby(['leg_id', 'week']).
aggregate(lambda x: mdl.sum(x.tolist()))
for l in leg_ids:
for we in weeks:
mdl.add_constraint(leg_teu[(l, we)] == trans_sum_grpby.result[l, we])

 Re-writing using helper methods for generic patterns:
 To be compared with initial “naïve” slicing formulation:
 Performance vs readability trade-off
trans_df['result'] = trans_df.apply(lambda row: row.trans * size[row.eqc], axis=1)
trans_sum_grpby = for_cross_prod_sum_by([leg_ids, weeks], trans_df,
['leg_id', 'week'], 'result')
for l in leg_ids:
for we in weeks:
mdl.add_constraint(leg_teu[(l, we)] == trans_sum_grpby.result[l, we])
for l in leg_ids:
for we in weeks:
mdl.add_constraint(
leg_teu[(l, we)] == mdl.sum(t.trans * size[t.eqc] for t in trans_df_list
if t.leg_id == l and w[t.date] == we))

Conclusion
 Python is one of the most relevant tools to easily turn an idea into working code
when dealing with data-wrangling problems, and then visualize their results.
 The exact same code that has been written and tested in a notebook for loading
data, modelling an optimization problem, solving it… can readily be integrated
and executed in a deployed Python environment.
 DOcplex objective: facilitate the diffusion and use of optimization technologies
 DOcplex + pandas: an alternative to specialized modelling languages
 On-going effort for defining “best practices” and patterns to:
 address performance issues
 facilitate formulation of models formulation that is readable and maintainable

Thank you!
Questions/Answers

Legal Disclaimer
• © IBM Corporation 2016. All Rights Reserved.
• The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained
in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are
subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing
contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and
conditions of the applicable license agreement governing the use of IBM software.
• References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or
capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment
to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by
you will result in any specific sales, revenue growth or other results.
• If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete:
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will
experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage
configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
• If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete:
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs
and performance characteristics may vary by customer.
• Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM
Lotus® Sametime® Unyte™). Subsequent references can drop “IBM” but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server).
Please refer to http://www.ibm.com/legal/copytrade.shtml for guidance on which trademarks require the ® or ™ symbol. Do not use abbreviations for IBM product names in your
presentation. All product names must be used as adjectives rather than nouns. Please list all of the trademarks that you use in your presentation as follows; delete any not included in
your presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld and Lotusphere are trademarks of International
Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both.
• If you reference Adobe® in the text, please mark the first use and include the following; otherwise delete:
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other
countries.
• If you reference Java™ in the text, please mark the first use and include the following; otherwise delete:
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
• If you reference Microsoft® and/or Windows® in the text, please mark the first use and include the following, as applicable; otherwise delete:
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.
• If you reference Intel® and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete:
Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States
and other countries.
• If you reference UNIX® in the text, please mark the first use and include the following; otherwise delete:
UNIX is a registered trademark of The Open Group in the United States and other countries.
• If you reference Linux® in your presentation, please mark the first use and include the following; otherwise delete:
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of
others.
• If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations, Zeta
Bank, Acme) please update and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration
purposes only.

Prespective analytics with DOcplex and pandas

More Related Content

What's hot

Viewers also liked

Similar to Prespective analytics with DOcplex and pandas

Recently uploaded

Prespective analytics with DOcplex and pandas