Data science on big data. Pragmatic approach

1.
Data science usingBig Data. Pragmatic approach. Pavel Mesentsev Dmitri Babaev Tinkoff Bank

2.
Machine learning inbusiness

3.
Other ML tasksexamples

4.
Size of data ~1 mln obj ~ 1 bln obj ~ 1 trln obj Small Data Big Data Huge Data

5.
Predictive models design

6.
Small data case

10.
Big data case

11.
Big data case

12.
Apache Spark case

13.
How does itsworks

14.
How does itsworks

15.
How does itsworks

16.
How does itsworks

17.
How does itsworks

18.
How does itsworks

19.
How does itsworks

20.
Spark Sql

21.
Spark Sql

22.
Spark Sql

23.
Spark Sql

24.
Spark Sql

25.
Spark Sql

26.
Spark Sql

27.
Spark Sql

28.
Spark Sql

29.
Apache Spark case

30.
Apache Spark in Large ScaleMachine Learning

31.
LSML issues • Toomuch samples to classify • Training data does not fit in memory • Too much training samples • Too much models to train

32.
Why not MLLib? •MLLib is less stable • too few algorithms comparing to scikit-learn • ML pipelines are not so mature than in scikit-learn • e. g. there is no simple way to use logistic regression for feature selection • MLLib python API falls behind Java/Scala API • MLLib is actively developed and may be feasible choice in near future

33.
Spark + scikit-learn= ? • Parallel training • meta-parameter grid search • parallel one-vs-rest for multi-class models • same features but different targets • parallel bagging and ensembles • parallel learning for multi-step classification • Parallel prediction

34.
Distributed learning

35.

36.

37.

38.

39.

40.

41.

42.

43.
Distributed prediction

44.

45.

46.

47.

48.

49.

50.
Models management • Acomplex ML task can be expressed as scikit-learn pipeline •e. g. feature scaling, then LR feature selection then GBT classification • Trained models can be stored on HDFS / S3 along with training report and loaded for prediction

51.
Alternative approaches toLSML • partial / online learning •e. g. stochastic gradient descent • distributed stochastic gradient descent •is implemented in MLLib

52.
Spark is everywhere •Mahout on spark aka Samsara • H20 Sparkling water

53.
Questions? Pavel Mezentsev pavel@mezentsev.org DmitriBabaev dmitri.lb@gmail.com

Data science on big data. Pragmatic approach

More Related Content

What's hot

Similar to Data science on big data. Pragmatic approach

More from Pavel Mezentsev

Recently uploaded

Data science on big data. Pragmatic approach