© 2016 IBM Corporation
Prescriptive Analytics with
DOcplex and pandas
Hugues JUILLE
© 2016 IBM Corporation2
Agenda
• What is Prescriptive Analytics?
• Why Python for Prescriptive Analytics?
• DOcplex: What is it?
• Using DOcplex for modelling an Optimization problem
• Using pandas for improved modelling capabilities
© 2016 IBM Corporation3
What is Prescriptive Analysis?
 Also known as:
Decision Optimization
 Prescriptive analytics is about:
 recommending actions,
 based on desired outcomes,
 taking into account :
• specific scenarios,
• limited resources and
• knowledge of past and current events.
 This insight can help organizations make better decisions and have greater
control of business outcomes.
Prescriptive
Analytics
Descriptive
Analytics
Predictive
Analytics
How can we
make it happen?
What will
happen?
What
happened?
Difficulty
Value
© 2016 IBM Corporation4
The Science of Better Decisions
What to build,
where and when?
How to best allocate
aircrafts and crews?
Risk vs. potential reward
Inventory cost vs.
customer satisfaction
Cost vs.carbon
emission
Optimization helps businesses:
• create the best possible plans
• explore alternatives and understand trade-off
• respond to changes in business operations
© 2016 IBM Corporation5
How does Optimization work?
© 2016 IBM Corporation6
What is an optimization model?
An optimization model is
composed of:
• Decision variables
• Constraints
• An objective function
Solving a model means:
Finding an assignment to
decision variables that:
• minimize or maximize the
objective function,
• subject to meeting all
constraints
A Constraints Programming
(CP) model:
• Based on higher level constructs:
• Discrete or interval variables
• Rich set of logical, arithmetic or
(non-linear) functional constraints
over variables
• Dedicated to combinatorial /
scheduling problems
A Mathematical Programming
model:
© 2016 IBM Corporation7
Agenda
• What is Prescriptive Analytics?
• Why Python for Prescriptive Analytics?
• DOcplex: What is it?
• Using DOcplex for modelling an Optimization problem
• Using pandas for improved modelling capabilities
© 2016 IBM Corporation8
Modelling languages for Prescriptive Analytics
Input data definition
Decision variables: How much to
produce for each product
Objective: maximize profit
Constraints: demand for components
cannot exceed stock
 Modelling languages for Prescriptive Analytics: AMPL, GAMMS, OPL…
 Enable concise formulations close to mathematical language, intensive use
of matrices representation…
𝑃𝑟𝑜𝑓𝑖𝑡 𝑝 × 𝑃𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑜𝑛 𝑝
𝑝
∀𝑐, 𝐷𝑒𝑚𝑎𝑛𝑑 𝑝,𝑐 × 𝑃𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑜𝑛 𝑝 ≤ 𝑆𝑡𝑜𝑐𝑘 𝑐
𝑝
𝑃𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑜𝑛 𝑝 ≥ 0
© 2016 IBM Corporation9
Why Python for Prescriptive Analytics?
 Take advantage of Python expressiveness (generators, aggregators,
operator overloading, tuples…).
 Python capabilities make it a viable alternative to specialized modelling
languages:
 1 single language to create the constraints AND do the workflow.
 Standard libraries with abstract constructs to manipulate: vectors,
matrices, relational data model…
 Ecosystem, ease of use, proven robustness, data ingest
 Workflow and mathematical description are part of the language, no
memory management
© 2016 IBM Corporation10
Why Python for Prescriptive Analytics?
Core Python libraries for scientific people
Notebooks = great technology for prototyping optimization
models in an interactive way
Leverage Big Data tools, such as Apache Spark.
© 2016 IBM Corporation11
Agenda
• What is Prescriptive Analytics?
• Why Python for Prescriptive Analytics?
• DOcplex: What is it?
• Using DOcplex for modelling an Optimization problem
• Using pandas for improved modelling capabilities
© 2016 IBM Corporation12
DOcplex: What is it? How to get it?
• Easily formulate your optimization models and solve them with IBM Decision Optimization on the
Cloud solve service or CPLEX local solver (with 0 code change).
• Access to free solve capabilities to discover this new API is made easy thanks to our cloud free trial
and our new CPLEX Optimization Studio free Community Edition (aka COS CE): you can get access
to any of those two with the help of one mail address.
• Available through the standard Python pip install with no need to download anything else or
contact any IBM person if you go full cloud.
• Just look for docplex in your browser to get access to docplex pypi repo or doc.
© 2016 IBM Corporation13
Comprehensive documentation and resources
All documentation and resources are available on-line
Educative: examples / cookbooks for all levels of expertise:
Discovering IBM Decision Optimization technologies…
…Reference manuals for APIs
Social: community / forums
© 2016 IBM Corporation14
DOcplex for optimization modelling (MP)
Import DOcplex MP package import docplex.mp
Create the container for your model
Define decision variables
(individually or as collections,
discrete or continuous)
Define constraints over variables
Define objective
Solve the model using local Cplex
or on the cloud
mdl = Model('Warehouse')
x = mdl.add_continuous_var('totDmd')
supply_vars =
mdl.binary_var_matrix(warehouses,
stores, 'supply')
mdl.add_constraint(supply_vars[w, s] <=
open_vars[w])
mdl.add_constraint(mdl.sum(supply_vars[w,
s] for s in stores) <= w.capacity)
mdl.minimize(total_opening_cost +
total_supply_cost)
mdl.solve()
mdl.solve(url=SVC_URL, key=SVC_KEY)
© 2016 IBM Corporation15
Agenda
• What is Prescriptive Analytics?
• Why Python for Prescriptive Analytics?
• DOcplex: What is it?
• Using DOcplex for modelling an Optimization problem
• Using pandas for improved modelling capabilities
© 2016 IBM Corporation16
DOcplex and Notebooks for Optimization
© 2016 IBM Corporation17
DOcplex and Notebooks for Optimization
Installing DOcplex and configuring your credentials
© 2016 IBM Corporation18
DOcplex and Notebooks for Optimization
Easy to download and parse json
© 2016 IBM Corporation19
DOcplex and Notebooks for Optimization
Visualizing the input data
© 2016 IBM Corporation20
DOcplex and Notebooks for Optimization
© 2016 IBM Corporation21
DOcplex and Notebooks for Optimization
© 2016 IBM Corporation22
DOcplex and Notebooks for Optimization
© 2016 IBM Corporation23
DOcplex and Notebooks for Optimization
© 2016 IBM Corporation24
Agenda
• What is Prescriptive Analytics?
• Why Python for Prescriptive Analytics?
• DOcplex: What is it?
• Using DOcplex for modelling an Optimization problem
• Using pandas for improved modelling capabilities
© 2016 IBM Corporation25
Slicing and Aggregate constructs
 Two important constructs to describe complex problems in a compact form:
 Slicing filters: select a subset of items in a multi-dimensional collection
 Aggregate:
• used in combination with slicing,
• build the actual mathematical expression
OPL:
DOcplex:
forall ( l in leg_ids, we in weeks )
leg_teu[l][we] == sum (tv in trans_vars : tv.l.leg_id == l && w[tv.date] == we)
trans[tv] * size [tv.eqc];
for l in leg_ids:
for we in weeks:
mdl.add_constraint(
leg_teu[(l, we)] == mdl.sum(trans[(tv.leg_id, tv.mot, tv.date, tv.eqc)] *
size[tv.eqc] for tv in trans_vars_list
if tv.leg_id == l and w[tv.date] == we))
© 2016 IBM Corporation26
Performance considerations
 Runtime model generation should be as effective as possible:
 may be invoked thousands of time when running in production
 large models may involve millions of variables and constraints
 “naïve” translation of slicing/aggregate in Python can be very inefficient when
nested loops are involved
 Use pandas for handling slicing on large collections
“pandas is an open source library providing high-performance, easy-to-
use data structures and data analysis tools for the Python programming
language”
 DOcplex can benefit of the following pandas features:
 Data organized in multi-indexed tables
 Efficient merge operations between tables
 Efficient indexing, filtering and grouping operations on tables
© 2016 IBM Corporation27
Performance considerations
 Data Frame
trans_df:
 “naïve” slicing:
--> Elapsed time: 5875 ms
 Slicing with pandas:
--> Elapsed time: 4681 ms
with SimpleTimer("TEU EQUATIONS-3", print_details=False):
for l in leg_ids:
for we in weeks:
mdl.add_constraint(
leg_teu[(l, we)] == mdl.sum(t.trans * size[t.eqc] for t in trans_df_list
if t.leg_id == l and w[t.date] == we))
with SimpleTimer("TEU EQUATIONS-3", print_details=False):
trans_df['week'] = trans_df.apply(lambda row: w[row.date], axis=1)
for l in leg_ids:
for we in weeks:
slice_df = trans_df.loc[(trans_df.leg_id == l) & (trans_df.week == we)]
mdl.add_constraint(
leg_teu[(l, we)] == mdl.sum(t.trans * size[t.eqc]
for t in slice_df.itertuples()))
trans eqc leg_id mot date week
@trans_01 DRY-20 CDC-BOR Truck 10/06/16 23
@trans_02 HIGH-40 CHE-MAR Train 10/06/16 23
… … … … … …
© 2016 IBM Corporation28
Performance considerations
 Issue with this formulation:
 Slicing is calculated inside the nested loops
 cost of creating a pandas Data Frame is incurred at each iteration
 Much better strategy:
 Prepare the results of all slicing filters before entering the nested loops
 This can be done thanks to pandas’ groupby and aggregate operations
with SimpleTimer("TEU EQUATIONS-3", print_details=False):
trans_df['week'] = trans_df.apply(lambda row: w[row.date], axis=1)
for l in leg_ids:
for we in weeks:
slice_df = trans_df.loc[(trans_df.leg_id == l) & (trans_df.week == we)]
mdl.add_constraint(
leg_teu[(l, we)] == mdl.sum(t.trans * size[t.eqc]
for t in slice_df.itertuples()))
© 2016 IBM Corporation29
Performance considerations
 Prepare all results of slicing beforehand:
--> Elapsed time: 2323 ms
 Based on two pandas operations:
 groupby: split dataset into groups
 aggregate: perform a computation on the grouped data
with SimpleTimer("TEU EQUATIONS-3", print_details=False):
trans_df['week'] = trans_df.apply(lambda row: w[row.date], axis=1)
trans_df['result'] = trans_df.apply(lambda row: row.trans * size[row.eqc], axis=1)
legWeeksMultiIndex = pd.MultiIndex.from_product([leg_ids, weeks], names=["leg_id", "week"])
legWeeksMultiIndex_df = pd.DataFrame(legWeeksMultiIndex.values.tolist(),
columns=["leg_id", "week"])
trans_full_df = legWeeksMultiIndex_df.merge(trans_df, how='left').fillna(0)
trans_sum_grpby = trans_full_df[['leg_id', 'week', 'result']].groupby(['leg_id', 'week']).
aggregate(lambda x: mdl.sum(x.tolist()))
for l in leg_ids:
for we in weeks:
mdl.add_constraint(leg_teu[(l, we)] == trans_sum_grpby.result[l, we])
© 2016 IBM Corporation30
Performance considerations
 Re-writing using helper methods for generic patterns:
 To be compared with initial “naïve” slicing formulation:
 Performance vs readability trade-off
with SimpleTimer("TEU EQUATIONS-3", print_details=False):
trans_df['week'] = trans_df.apply(lambda row: w[row.date], axis=1)
trans_df['result'] = trans_df.apply(lambda row: row.trans * size[row.eqc], axis=1)
trans_sum_grpby = for_cross_prod_sum_by([leg_ids, weeks], trans_df,
['leg_id', 'week'], 'result')
for l in leg_ids:
for we in weeks:
mdl.add_constraint(leg_teu[(l, we)] == trans_sum_grpby.result[l, we])
with SimpleTimer("TEU EQUATIONS-3", print_details=False):
for l in leg_ids:
for we in weeks:
mdl.add_constraint(
leg_teu[(l, we)] == mdl.sum(t.trans * size[t.eqc] for t in trans_df_list
if t.leg_id == l and w[t.date] == we))
© 2016 IBM Corporation31
Conclusion
 Python is one of the most relevant tools to easily turn an idea into working code
when dealing with data-wrangling problems, and then visualize their results.
 The exact same code that has been written and tested in a notebook for loading
data, modelling an optimization problem, solving it… can readily be integrated
and executed in a deployed Python environment.
 DOcplex objective: facilitate the diffusion and use of optimization technologies
 DOcplex + pandas: an alternative to specialized modelling languages
 On-going effort for defining “best practices” and patterns to:
 address performance issues
 facilitate formulation of models formulation that is readable and maintainable
© 2016 IBM Corporation32
Thank you!
Questions/Answers
© 2016 IBM Corporation34
Legal Disclaimer
• © IBM Corporation 2016. All Rights Reserved.
• The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained
in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are
subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing
contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and
conditions of the applicable license agreement governing the use of IBM software.
• References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or
capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment
to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by
you will result in any specific sales, revenue growth or other results.
• If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete:
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will
experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage
configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
• If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete:
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs
and performance characteristics may vary by customer.
• Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM
Lotus® Sametime® Unyte™). Subsequent references can drop “IBM” but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server).
Please refer to http://www.ibm.com/legal/copytrade.shtml for guidance on which trademarks require the ® or ™ symbol. Do not use abbreviations for IBM product names in your
presentation. All product names must be used as adjectives rather than nouns. Please list all of the trademarks that you use in your presentation as follows; delete any not included in
your presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld and Lotusphere are trademarks of International
Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both.
• If you reference Adobe® in the text, please mark the first use and include the following; otherwise delete:
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other
countries.
• If you reference Java™ in the text, please mark the first use and include the following; otherwise delete:
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
• If you reference Microsoft® and/or Windows® in the text, please mark the first use and include the following, as applicable; otherwise delete:
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.
• If you reference Intel® and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete:
Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States
and other countries.
• If you reference UNIX® in the text, please mark the first use and include the following; otherwise delete:
UNIX is a registered trademark of The Open Group in the United States and other countries.
• If you reference Linux® in your presentation, please mark the first use and include the following; otherwise delete:
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of
others.
• If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations, Zeta
Bank, Acme) please update and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration
purposes only.

Prespective analytics with DOcplex and pandas

  • 1.
    © 2016 IBMCorporation Prescriptive Analytics with DOcplex and pandas Hugues JUILLE
  • 2.
    © 2016 IBMCorporation2 Agenda • What is Prescriptive Analytics? • Why Python for Prescriptive Analytics? • DOcplex: What is it? • Using DOcplex for modelling an Optimization problem • Using pandas for improved modelling capabilities
  • 3.
    © 2016 IBMCorporation3 What is Prescriptive Analysis?  Also known as: Decision Optimization  Prescriptive analytics is about:  recommending actions,  based on desired outcomes,  taking into account : • specific scenarios, • limited resources and • knowledge of past and current events.  This insight can help organizations make better decisions and have greater control of business outcomes. Prescriptive Analytics Descriptive Analytics Predictive Analytics How can we make it happen? What will happen? What happened? Difficulty Value
  • 4.
    © 2016 IBMCorporation4 The Science of Better Decisions What to build, where and when? How to best allocate aircrafts and crews? Risk vs. potential reward Inventory cost vs. customer satisfaction Cost vs.carbon emission Optimization helps businesses: • create the best possible plans • explore alternatives and understand trade-off • respond to changes in business operations
  • 5.
    © 2016 IBMCorporation5 How does Optimization work?
  • 6.
    © 2016 IBMCorporation6 What is an optimization model? An optimization model is composed of: • Decision variables • Constraints • An objective function Solving a model means: Finding an assignment to decision variables that: • minimize or maximize the objective function, • subject to meeting all constraints A Constraints Programming (CP) model: • Based on higher level constructs: • Discrete or interval variables • Rich set of logical, arithmetic or (non-linear) functional constraints over variables • Dedicated to combinatorial / scheduling problems A Mathematical Programming model:
  • 7.
    © 2016 IBMCorporation7 Agenda • What is Prescriptive Analytics? • Why Python for Prescriptive Analytics? • DOcplex: What is it? • Using DOcplex for modelling an Optimization problem • Using pandas for improved modelling capabilities
  • 8.
    © 2016 IBMCorporation8 Modelling languages for Prescriptive Analytics Input data definition Decision variables: How much to produce for each product Objective: maximize profit Constraints: demand for components cannot exceed stock  Modelling languages for Prescriptive Analytics: AMPL, GAMMS, OPL…  Enable concise formulations close to mathematical language, intensive use of matrices representation… 𝑃𝑟𝑜𝑓𝑖𝑡 𝑝 × 𝑃𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑜𝑛 𝑝 𝑝 ∀𝑐, 𝐷𝑒𝑚𝑎𝑛𝑑 𝑝,𝑐 × 𝑃𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑜𝑛 𝑝 ≤ 𝑆𝑡𝑜𝑐𝑘 𝑐 𝑝 𝑃𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑜𝑛 𝑝 ≥ 0
  • 9.
    © 2016 IBMCorporation9 Why Python for Prescriptive Analytics?  Take advantage of Python expressiveness (generators, aggregators, operator overloading, tuples…).  Python capabilities make it a viable alternative to specialized modelling languages:  1 single language to create the constraints AND do the workflow.  Standard libraries with abstract constructs to manipulate: vectors, matrices, relational data model…  Ecosystem, ease of use, proven robustness, data ingest  Workflow and mathematical description are part of the language, no memory management
  • 10.
    © 2016 IBMCorporation10 Why Python for Prescriptive Analytics? Core Python libraries for scientific people Notebooks = great technology for prototyping optimization models in an interactive way Leverage Big Data tools, such as Apache Spark.
  • 11.
    © 2016 IBMCorporation11 Agenda • What is Prescriptive Analytics? • Why Python for Prescriptive Analytics? • DOcplex: What is it? • Using DOcplex for modelling an Optimization problem • Using pandas for improved modelling capabilities
  • 12.
    © 2016 IBMCorporation12 DOcplex: What is it? How to get it? • Easily formulate your optimization models and solve them with IBM Decision Optimization on the Cloud solve service or CPLEX local solver (with 0 code change). • Access to free solve capabilities to discover this new API is made easy thanks to our cloud free trial and our new CPLEX Optimization Studio free Community Edition (aka COS CE): you can get access to any of those two with the help of one mail address. • Available through the standard Python pip install with no need to download anything else or contact any IBM person if you go full cloud. • Just look for docplex in your browser to get access to docplex pypi repo or doc.
  • 13.
    © 2016 IBMCorporation13 Comprehensive documentation and resources All documentation and resources are available on-line Educative: examples / cookbooks for all levels of expertise: Discovering IBM Decision Optimization technologies… …Reference manuals for APIs Social: community / forums
  • 14.
    © 2016 IBMCorporation14 DOcplex for optimization modelling (MP) Import DOcplex MP package import docplex.mp Create the container for your model Define decision variables (individually or as collections, discrete or continuous) Define constraints over variables Define objective Solve the model using local Cplex or on the cloud mdl = Model('Warehouse') x = mdl.add_continuous_var('totDmd') supply_vars = mdl.binary_var_matrix(warehouses, stores, 'supply') mdl.add_constraint(supply_vars[w, s] <= open_vars[w]) mdl.add_constraint(mdl.sum(supply_vars[w, s] for s in stores) <= w.capacity) mdl.minimize(total_opening_cost + total_supply_cost) mdl.solve() mdl.solve(url=SVC_URL, key=SVC_KEY)
  • 15.
    © 2016 IBMCorporation15 Agenda • What is Prescriptive Analytics? • Why Python for Prescriptive Analytics? • DOcplex: What is it? • Using DOcplex for modelling an Optimization problem • Using pandas for improved modelling capabilities
  • 16.
    © 2016 IBMCorporation16 DOcplex and Notebooks for Optimization
  • 17.
    © 2016 IBMCorporation17 DOcplex and Notebooks for Optimization Installing DOcplex and configuring your credentials
  • 18.
    © 2016 IBMCorporation18 DOcplex and Notebooks for Optimization Easy to download and parse json
  • 19.
    © 2016 IBMCorporation19 DOcplex and Notebooks for Optimization Visualizing the input data
  • 20.
    © 2016 IBMCorporation20 DOcplex and Notebooks for Optimization
  • 21.
    © 2016 IBMCorporation21 DOcplex and Notebooks for Optimization
  • 22.
    © 2016 IBMCorporation22 DOcplex and Notebooks for Optimization
  • 23.
    © 2016 IBMCorporation23 DOcplex and Notebooks for Optimization
  • 24.
    © 2016 IBMCorporation24 Agenda • What is Prescriptive Analytics? • Why Python for Prescriptive Analytics? • DOcplex: What is it? • Using DOcplex for modelling an Optimization problem • Using pandas for improved modelling capabilities
  • 25.
    © 2016 IBMCorporation25 Slicing and Aggregate constructs  Two important constructs to describe complex problems in a compact form:  Slicing filters: select a subset of items in a multi-dimensional collection  Aggregate: • used in combination with slicing, • build the actual mathematical expression OPL: DOcplex: forall ( l in leg_ids, we in weeks ) leg_teu[l][we] == sum (tv in trans_vars : tv.l.leg_id == l && w[tv.date] == we) trans[tv] * size [tv.eqc]; for l in leg_ids: for we in weeks: mdl.add_constraint( leg_teu[(l, we)] == mdl.sum(trans[(tv.leg_id, tv.mot, tv.date, tv.eqc)] * size[tv.eqc] for tv in trans_vars_list if tv.leg_id == l and w[tv.date] == we))
  • 26.
    © 2016 IBMCorporation26 Performance considerations  Runtime model generation should be as effective as possible:  may be invoked thousands of time when running in production  large models may involve millions of variables and constraints  “naïve” translation of slicing/aggregate in Python can be very inefficient when nested loops are involved  Use pandas for handling slicing on large collections “pandas is an open source library providing high-performance, easy-to- use data structures and data analysis tools for the Python programming language”  DOcplex can benefit of the following pandas features:  Data organized in multi-indexed tables  Efficient merge operations between tables  Efficient indexing, filtering and grouping operations on tables
  • 27.
    © 2016 IBMCorporation27 Performance considerations  Data Frame trans_df:  “naïve” slicing: --> Elapsed time: 5875 ms  Slicing with pandas: --> Elapsed time: 4681 ms with SimpleTimer("TEU EQUATIONS-3", print_details=False): for l in leg_ids: for we in weeks: mdl.add_constraint( leg_teu[(l, we)] == mdl.sum(t.trans * size[t.eqc] for t in trans_df_list if t.leg_id == l and w[t.date] == we)) with SimpleTimer("TEU EQUATIONS-3", print_details=False): trans_df['week'] = trans_df.apply(lambda row: w[row.date], axis=1) for l in leg_ids: for we in weeks: slice_df = trans_df.loc[(trans_df.leg_id == l) & (trans_df.week == we)] mdl.add_constraint( leg_teu[(l, we)] == mdl.sum(t.trans * size[t.eqc] for t in slice_df.itertuples())) trans eqc leg_id mot date week @trans_01 DRY-20 CDC-BOR Truck 10/06/16 23 @trans_02 HIGH-40 CHE-MAR Train 10/06/16 23 … … … … … …
  • 28.
    © 2016 IBMCorporation28 Performance considerations  Issue with this formulation:  Slicing is calculated inside the nested loops  cost of creating a pandas Data Frame is incurred at each iteration  Much better strategy:  Prepare the results of all slicing filters before entering the nested loops  This can be done thanks to pandas’ groupby and aggregate operations with SimpleTimer("TEU EQUATIONS-3", print_details=False): trans_df['week'] = trans_df.apply(lambda row: w[row.date], axis=1) for l in leg_ids: for we in weeks: slice_df = trans_df.loc[(trans_df.leg_id == l) & (trans_df.week == we)] mdl.add_constraint( leg_teu[(l, we)] == mdl.sum(t.trans * size[t.eqc] for t in slice_df.itertuples()))
  • 29.
    © 2016 IBMCorporation29 Performance considerations  Prepare all results of slicing beforehand: --> Elapsed time: 2323 ms  Based on two pandas operations:  groupby: split dataset into groups  aggregate: perform a computation on the grouped data with SimpleTimer("TEU EQUATIONS-3", print_details=False): trans_df['week'] = trans_df.apply(lambda row: w[row.date], axis=1) trans_df['result'] = trans_df.apply(lambda row: row.trans * size[row.eqc], axis=1) legWeeksMultiIndex = pd.MultiIndex.from_product([leg_ids, weeks], names=["leg_id", "week"]) legWeeksMultiIndex_df = pd.DataFrame(legWeeksMultiIndex.values.tolist(), columns=["leg_id", "week"]) trans_full_df = legWeeksMultiIndex_df.merge(trans_df, how='left').fillna(0) trans_sum_grpby = trans_full_df[['leg_id', 'week', 'result']].groupby(['leg_id', 'week']). aggregate(lambda x: mdl.sum(x.tolist())) for l in leg_ids: for we in weeks: mdl.add_constraint(leg_teu[(l, we)] == trans_sum_grpby.result[l, we])
  • 30.
    © 2016 IBMCorporation30 Performance considerations  Re-writing using helper methods for generic patterns:  To be compared with initial “naïve” slicing formulation:  Performance vs readability trade-off with SimpleTimer("TEU EQUATIONS-3", print_details=False): trans_df['week'] = trans_df.apply(lambda row: w[row.date], axis=1) trans_df['result'] = trans_df.apply(lambda row: row.trans * size[row.eqc], axis=1) trans_sum_grpby = for_cross_prod_sum_by([leg_ids, weeks], trans_df, ['leg_id', 'week'], 'result') for l in leg_ids: for we in weeks: mdl.add_constraint(leg_teu[(l, we)] == trans_sum_grpby.result[l, we]) with SimpleTimer("TEU EQUATIONS-3", print_details=False): for l in leg_ids: for we in weeks: mdl.add_constraint( leg_teu[(l, we)] == mdl.sum(t.trans * size[t.eqc] for t in trans_df_list if t.leg_id == l and w[t.date] == we))
  • 31.
    © 2016 IBMCorporation31 Conclusion  Python is one of the most relevant tools to easily turn an idea into working code when dealing with data-wrangling problems, and then visualize their results.  The exact same code that has been written and tested in a notebook for loading data, modelling an optimization problem, solving it… can readily be integrated and executed in a deployed Python environment.  DOcplex objective: facilitate the diffusion and use of optimization technologies  DOcplex + pandas: an alternative to specialized modelling languages  On-going effort for defining “best practices” and patterns to:  address performance issues  facilitate formulation of models formulation that is readable and maintainable
  • 32.
    © 2016 IBMCorporation32 Thank you! Questions/Answers
  • 34.
    © 2016 IBMCorporation34 Legal Disclaimer • © IBM Corporation 2016. All Rights Reserved. • The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. • References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. • If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete: Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. • If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete: All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. • Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM Lotus® Sametime® Unyte™). Subsequent references can drop “IBM” but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server). Please refer to http://www.ibm.com/legal/copytrade.shtml for guidance on which trademarks require the ® or ™ symbol. Do not use abbreviations for IBM product names in your presentation. All product names must be used as adjectives rather than nouns. Please list all of the trademarks that you use in your presentation as follows; delete any not included in your presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld and Lotusphere are trademarks of International Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both. • If you reference Adobe® in the text, please mark the first use and include the following; otherwise delete: Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. • If you reference Java™ in the text, please mark the first use and include the following; otherwise delete: Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. • If you reference Microsoft® and/or Windows® in the text, please mark the first use and include the following, as applicable; otherwise delete: Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both. • If you reference Intel® and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete: Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. • If you reference UNIX® in the text, please mark the first use and include the following; otherwise delete: UNIX is a registered trademark of The Open Group in the United States and other countries. • If you reference Linux® in your presentation, please mark the first use and include the following; otherwise delete: Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. • If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations, Zeta Bank, Acme) please update and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration purposes only.