Create a Data Science Lab with Microsoft and Open Source tools
1.
2. Create a Data Science Lab with
Microsoft and Open Source Tools
Marcel Franke, pmOne AG, Germany
3. About me – Marcel Franke
Practice Lead Advanced Analytics & Data Science
pmOne AG – Germany, Austria, Switzerland
>10 years experiences with large scale
Data Warehouses based on SQL Server
Blog: dwjunkie.wordpress.com
5. The Definition
Data science incorporates varying
elements and builds on techniques and
theories from many fields, including
mathematics, statistics, data engineering,
pattern recognition and learning, advanced
computing, visualization, uncertainty
modeling, data warehousing, and high
performance computing with the goal of
extracting meaning from data and
creating data products.
Source: http://en.wikipedia.org/wiki/Data_science
8. The beginnings of gambling
Gambling exists since 3000 BC
First games based on dices
Origin in China and Mesopotamian
* Source: Tiemeyer, E.; Zsifkovitis, H.: Information als Führungsmittel, München: Computerwoche Verlag 1995
9. Scientific foundations
17th century Paradox of
Chevaliers de Méré
LaPlace und Fermat discussed
the paradox in several letters
The beginning of theory of
probability
* Source: http://de.wikipedia.org/wiki/De-M%C3%A9r%C3%A9-Paradoxon
10. The science in Data Science
Calculate probabilities
Pattern recognition
Calculation of analytical variance
Machine Learning
Simulations
Predictions
19. Other areas of application
SOCIAL
MEDIA
PRODUCT REMOMMENDATION
RETARGETING
PREDICTIVE
MAINTENANCE
PREDICT RISKS
areas of
application
SALES PREDICTIONS
CUSTOMER ANLYSIS
DYNAMIC PRICING
DISPOSITION
21. Our starting point…
Structured data
Unstructured data
Harmonize and
generate Information
(Role of „Data Scientist“)
„BIG Data“
Volume, Variety, Velocity
22. Typical Big Data Architecture
Big Data Analytics
Excel
Big Data Advanced Analytics
PowerPivot
Big Data Preparation (SQL, Map Reduce)
Unstructured data
Structured data
Massive Parallel Processing
Big Data Storage Platform
23. “[Facebook] started in the Hadoop world. We are now bringing in
relational to enhance that. We're kind of going [in] the other
direction.”
“We've been there, and [we] realized that using the wrong
technology for certain kinds of problems can be difficult. We
started at the end and we're working our way backwards, bringing
in both.”
Ken Rudin,
Source: http://tdwi.org/articles/2013/05/06/facebooks-relationalplatform.aspx?j=192038&e=marcel.franke@pmone.com&l=50_HTML&u=3967541&mid=1060748&jb=84&m=1
Director of Analytics for Facebook
24. Some word to „R“
• R is a language and environment for statistical
computing and graphics
• R is Open Source under GNU general public license
• Most widely used statistical software
• Everything happens in-memory
• Comes with a package manager (~5000 packages)
• Provides also graphical functionalities
27. Starting Point
Problems, which we know from the BI world already, are further exacerbated by
big data.
•
Complexity of systems constantly grows
•
Amount of data growth exponentially (= Big Data)
•
Need for change is more frequent and is increasingly delving deeper into
business rules
•
Solutions can no longer be thought ahead
28. Solution Option 1 – Classic Deterministic
Everything can be planned and
design at the drawing board…
29. How does a system with products & components and their
relationships behaves with each other?
Quelle: Cesar Hidalgo
30. Solution Option 2 – Learn from „mother Nature“
• How does nature deal with complex non-linear systems?
• Evolution – Variation and selection – „Trial and Error“
„It is not the strongest of the species that
survives, nor the most intelligent but the one
most responsive to change.“ (Charles Darwin)
35. An efficient laboratory to experiment
Power Pivot
In-Memory
Microsoft Excel
Power View
Unstructured
Data
Power Query
Source Systems
Power Map
SQL Server
Structured
Data
OleD
B
Odata
WebServer-Logs
Sensor-Data
Data Marketplace
SAP
Databases
36.
37. Easy to cosume
The factory
Integrated in the business process
Analyze on mass data
Host it and run it
At Enterpise Scale
For Realtime Enterprise
38. Stable Big Data Architecture
Prediction &
Data Science
Front-Ends &
Mobile
Windows
Azure
On-Premises
Source Systems
Unstructured
Data
WebServer-Logs
Sensor-Data
HDInsight
SQL Server PDW
Data Marketplace
Structured
Data
SAP
Databases
42. How do we scale?
Relational data & compute
SQL Server 2012
Parallel Data
Warehouse
Half Rack
Infiniband
Analytical data &
compute
HP DL 385
40 Cores
2 TB RAM
Fusion-IO Card
43. What is Revolution Analytics?
• Founded in 2007
• Aim: Evolution of R for high-performance
• Offer R packages for faster performance and
greater stability
• Enterprise & Community products
• Stand-alone, Scale-out (HPC), on Hadoop
44. How do we handle our data?
R-ODBC: 10 MB/s
Flat file export: 80 MB/s
Data preparation
Data transfer
predictive scripts
45. Results
• Generate predictions for 30.000 customers
–
–
–
–
•
•
•
•
50.000 rows per customer, 54 columns
Customer goal: 5 Minutes
Our solution: 7.500 customers in 5 Minutes
Benchmark: 1 Minute
Revolution Analytics ODBC driver does not work with PDW
Standard R ODBC driver reads data with 10 MB/s
Workaround via flat file export
RDS format faster than csv
46. Other solutions?
• R in database
• R on Hadoop
– RHadoop
– Revolution Analytics RHadoop
48. THANK YOU!
• For attending this session and
PASS SQLRally Nordic 2013, Stockholm
49. Titles are set to 34 pt, Arial
Click to edit Master title style
• Level 1 text is 28 pt Arial
– Level 2 text is 24 pt Arial
• Level 3 text is 20 pt Arial
– Level 4 text is 20 pt Arial
• Level 5 text is 20 pt Arial
50. Notes (hidden)
• Some speakers may use this slide for hidden
notes
• Please delete if you prefer not to use
• Please note you are also able to use notes
section for each slide
Editor's Notes
A lotoftopicsandskillsarecombinedData Warehouse is also a partofitMore Statisticsandmathematicskillsareneeded
Wheredoes Data Science comefrom?
Whenyou do someresearch on thattopicyou will automaticallystumbleaboutgamblingorgamesofchances.
Dicecup
2 scientistsstartedthinkingaboutgamling on a morescientificway.Writing verylongletters back andforthDifferentprobabilitytowinifyouplaywith 1 diceor 2
1.)Howbigistheprobabilitytowinorloose, ortoreach a certaingoal?2.) Isthereanycorrelationbetweenthecustomerincomeandthesalesamount?5.) Whathappensifwechangecertainparameterslikeprice?6.) Whatisthesalesamoutof a certainproduct in thenextquarteroryear?
Howdoesthistopic fit to BI?
Whatcan I do withit?
So what do companies do withit?I consciouslydidn‘tusetheword Big Data but you all knowthatthisnewareaisveryhot in marketingandnews. So whatarethegoodexamples & usecases?
Kasse – cash deskBelohnung – rewardWindel - nappy
Stellwert von R herausheben -> fast alle Anbieter basieren auf RWir viel im Bereich Open Source verwendet
InjectorforwashingpelletsWaste, poorquality,
Ideaof a processmodellcalled Lab & FactoryExperimental approachIterativeFastFind newpatterns
Isforthedatascientisttoexperiment
Ifwefoundsomethinginteresting, wecandeployittothefactoryIt‘stheplacewherewerunouranalyticalcode at Enterprise scale
Mostoftheanalyticaltoolsare out thereforyearslike Databases, R, SAS, SPSSWeoftenherelimitations in scalability & performanceDB -> MPPR, SAS, -> In-Memory
POC on different analyticusecaseswiththebigvendorsComplex SQL-QueriesSimulationsPredictionswith R
SQL -> wir wissen wie wir skalierenR -> Skalierung schwierig, deshalb Revolution