6. Data – THE NEW POWER
Individual
Transaction-Level Data
Industry
Data
Internal Data
Data & Insights
Platform
Delivering
Business Values
for our clients
Data for Products and
Operational Process
Data for Dashboarding
and Business Decisions
Data for Predictive
Analytics
Allow us to deliver a
better service for our
customers
Allow us to optimise the
business and give the
better price to our
customers
Allow us to give more
knowledge to our
customers
7. Industry
Data
Individual
Transaction-Level Data Internal Data
Better Agility
Data Lake and Data Warehousing in the
same platform
Enable Data Discovery
Collect more data
Analyse the data with high performance
Next Gen of Data Visualisation on top of
Hadoop
14. The creativity part and lot
of trial / error process.
Feature engineering
Andrew Fogg win the competition
by categorising the colours of cars.
15. ● ML is often used in DS
● Currently, the buzz/trend ML is xgboost which gives most of the
time better result than the traditional Random Forest & Neural
Networks.
● Reason of the success? More Accurate, more efficient, easy to
use, customized and distributed.
● Need less spending time in Feature engineering but still need
some creativity.
Models to predict
17. ● ML is often used in DS
● Currently, the buzz/trend ML is xgboost which gives most of the
time better result than the traditional Random Forest & Neural
Networks.
● Reason of the success? More Accurate, more efficient, easy to
use, customized and distributed.
● Need less spending time in Feature engineering but still need
some creativity.
Models to predict
18. Evaluation - validations
● Overfitting/Underfitting
is the biggest fear of a
Data Scientist.
● Cross validation is one
way to protect the
model to not overfit
19. Feedback loop
● ML algorithm is a life system …
like any life specimen, it needs cares !!!
● Learning by his mistakes, it’s the only way
to progress and to fit a real AI model.
20. Bad Methodology
Main reasons:
• No clear business case
• Try to create the best accurate model in the first place
• No agility
• No code version control
21. An iterative delivery is key
Sprint 1
Sprint 2
Main take away:
• Agility is required
• Weekly delivered is highly recommended to avoid
falling to the “tunnel effect”
23. Gartner Says
“More Than 40
Percent of Data
Science Tasks Will
Be Automated by
2020”
Source: https://www.gartner.com/newsroom/id/3570917
Automation in Machine Learning is starting
24. Gain in Efficiency
● In the old age of BI world, we gain in efficiency by using ETL tool
rather than scripting codes.
However, ML is often associate with R/Python/Scala coding.
25. Dataiku Flow => enable AML
My favorite app
The Collaborative Data Science Platform: Dataiku