This document contains a summary of a presentation about data mining and machine learning tools in Python. It discusses popular Python packages like NumPy, SciPy, Pandas, Scikit-learn, Spark, and NetworkX. It also covers data sources, preprocessing, common algorithms like classification, regression and clustering. Finally, it provides examples of network graphs and discusses scaling tools from small to big data using databases, HDFS and analysis frameworks like Hive and Spark.
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
1. Speaker : Alexey Zinoviev
Python's slippy path and Tao of thick
Pandas: give my data, Rrrrr...
2. About
● I am a <graph theory, machine learning, traffic jams prediction,
BigData algorythms> scientist
● But I'm a <Java, Scala, Python, NoSQL, Hadoop, Spark>
programmer
11. ● Which loan applicants are high-risk?
● How do we detect phone card fraud?
● Which customers do prefer product A over product B?
● What is the revenue prediction for next year?
Typical questions for DM
21. ● Relational Databases
● Data warehouses (Historical data)
● Files in CSV or in binary format
● Internet or electronic mails
● Scientific, research (R, Octave,
Matlab)
Data sources
24. ● All your personal data (PD) are being
deeply mined
● The industry of collecting, aggregating,
and brokering PD is “database marketing.”
● 1.1 billion browser cookies, 200 million
mobile profiles, and an average of 1,500
pieces of data per consumer in Acxiom
Pay with your personal data
29. It is the process of finding model of function that describes and
distinguishes data class to predict the class of objects whose class
label is unknown.
What is Cluster Analysis?
30. ● Statistical process for estimating the
relationships among variables
● The estimation target is function (it can
be probability distribution)
● Can be linear, polynomial, nonlinear and
etc.
Regression
31. ● Training set of classified examples
(supervised learning)
● Test set of non-classified items
● Main goal: find a function (classifier) that
maps input data to a category
● Computer vision, drug discovery, speech
recognition, biometric indentification,
credit scoring
Classification
34. ● There are two classes of objects A & B
● Define the class of new object, based on
information about its neighbors
● Changing the boundaries of an new object
area, we form a set of neighbors.
● New object is B becuase majority of the
neighbors is a B.
kNN (k-nearest neighbor)
40. ● It’s free
● Not full implemented stack of ML
algorythms
● All your matrix are belong to us!
● Single thread model
● Hard way to integrate something
Why not Octave?
41.
42. ● Old ancient community
● Syntax is too sweet
● You should read 1000 lines in docs to
write 1 line of code
● Single thread model for 95%
algorythms
● Not good for your production
Why not R?
43. ● Now Python is an idol for young scientists
due to the low barrier to entry
● Python is a glue
● It’s solving two-language problem
● Have you ever heard about a Jython?
● Not so long way to real Highload
production
Why not Python?
45. ● A fast and efficient multidimensional
array object ndarray
● Tools for reading and writing array-based
data sets to disk
● Linear algebra operations, Fourier
transform
● Tools for integration C, C++, and Fortran
Son of the Fortran : NumPy
46. ● Sparse matrix, signal processing tools,
optimization, differential equation
● Integration of old Fortran libraries
● Tool for using inline C++ code to
accelerate array computations
● Powerful stat and discrete probability
distributions
SciPy as a servant
47.
48. ● The DataFrame, a twodimensional tabular, column-oriented
data structure with both row and column labels
● It is easy to reshape, slice and dice, perform aggregations, and
select subsets (EASY JOIN AND GROUP BY IN CODE!!!!)
● High-performance time series functionality and tools well-
suited for working with financial data
● Read and write CSV, JSON, HTML, XML, HDF5 and special
binary formats
Pandas : what’s new?
50. ● The DataFrame, a twodimensional tabular, column-oriented
data structure with both row and column labels
● It is easy to reshape, slice and dice, perform aggregations, and
select subsets (EASY JOIN AND GROUP BY IN CODE!!!!)
● High-performance time series functionality and tools well-
suited for working with financial data
● Read and write CSV, JSON, HTML, XML, HDF5 and special
binary formats
Pandas : what’s new?
51. ● Classification, regression, clustering,
decision trees
● Comparing, validating and choosing
parameters and models
● Feature extraction and normalization
● Encoding categorical features
● Dimensionality Reduction
Scikit-learn
57. ● MapReduce in memory
● Up to 50x faster than Hadoop
● Support for Shark (like Hive), MLlib
(Machine learning), GraphX (graph
processing)
● RDD is a basic building block (immutable
distributed collections of objects)
Spark
59. ● Python is dynamically typed, so RDDs can hold objects of
multiple types
● PySpark does not yet support a few API calls, such as lookup
and non-text input files
● PySpark requires Python 2.6 or higher (not tested with P3)
● PySpark applications are executed using a standard CPython
interpreter
● MLlib is supported, GraphX is not supported
PySpark (Python API for Spark)
63. ● Python language data structures for
graphs, digraphs, and multigraphs
● Generators for classic graphs, random
graphs, and synthetic networks
● Analyzing : clustering, independent set,
closeness, betweeness, connectivity,
transitivity, flows, isomorphism, Page
Rank
NetworkX
64. Graph Number of
vertexes
Number of
edges
Volume Data/per day
Web-graph 1,5 * 10^12 1,2 * 10^13 100 PB 300 TB
Facebook
(friends
graph)
1,1 * 10^9 160 * 10^9 1 PB 15 TB
Road graph
of EU
18 * 10^6 42 * 10^6 20 GB 50 MB
Road graph
of this city
250 000 460 000 500 MB 100 KB
65. ● Leaderboard
● How accurate does my model really need to
be? What data can I use? (bussiness
problem)
● Good clear data for unbelievable situations
● Fun & enjoyment building model
● Kaggle is too far from your production
Machine learning isn't Kaggle competitions
66. Size Classification Tools
Lines
Sample Data
Analysis and Visualization Whiteboard, bash
KBs - low MBs
Prototype Data
Analysis and Visualization Matlab, Octave, R
MBs - low GBs
Online Data
Storage MySQL (DBs)
MBs - low GBs
Online Data
Analysis NumPy, SciPy, Weka, BLAS/LAPACK
GBs - TBs - PBs
BigData
Storage HDFS, HBase, Cassandra
GBs - TBs - PBs
Big Data
Analysis Hive, Mahout, Hama, Giraph,MLlib
67. Mailing lists, google groups: pydata, pystatsmodels, numpy-
discussion, scipy-user
Conferences: PyCon, EuroPython and its friends SciPy and
EuroSciPy
Books: Python for Data Analysis, Mastering Machine Learning with
scikit-learn
Community, conferences and books