Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...

Speaker : Alexey Zinoviev
Python's slippy path and Tao of thick
Pandas: give my data, Rrrrr...

About
● I am a <graph theory, machine learning, traffic jams prediction,
BigData algorythms> scientist
● But I'm a <Java, Scala, Python, NoSQL, Hadoop, Spark>
programmer

We need in R dev 'cause Data Mining

You're a programmer, not an analyst

Let’s talk about it, Python-boy...

Data mining
Mining coal
in your data

Hey, man, predict me something!

● Which loan applicants are high-risk?
● How do we detect phone card fraud?
● Which customers do prefer product A over product B?
● What is the revenue prediction for next year?
Typical questions for DM

1. Selection
2. Pre-processing
3. Transformation
4. Data Mining
5. Interpretation/Evaluation
Magic part of KDD (Knowledge
Discovery in Databases)

1. Share your date with us
2. Our magic manipulations
3. Building an answering machine
4. PROFIT!!!
How it really works

● Facebook users, tweets
● Weather
● Sea routes
● Trade transactions
● Goverment
● Medicine (genomic data)
● Telecommuncations (phone call records)
Data examples

● Relational Databases
● Data warehouses (Historical data)
● Files in CSV or in binary format
● Internet or electronic mails
● Scientific, research (R, Octave,
Matlab)
Data sources

Target Data & Personal Data
22/68

● All your personal data (PD) are being
deeply mined
● The industry of collecting, aggregating,
and brokering PD is “database marketing.”
● 1.1 billion browser cookies, 200 million
mobile profiles, and an average of 1,500
pieces of data per consumer in Acxiom
Pay with your personal data

● Select small pieces
● Define default values for missed data
● Remove strange signals from data
● Merge some tables in one if required

It is the process of finding model of function that describes and
distinguishes data class to predict the class of objects whose class
label is unknown.
What is Cluster Analysis?

● Statistical process for estimating the
relationships among variables
● The estimation target is function (it can
be probability distribution)
● Can be linear, polynomial, nonlinear and
etc.
Regression

● Training set of classified examples
(supervised learning)
● Test set of non-classified items
● Main goal: find a function (classifier) that
maps input data to a category
● Computer vision, drug discovery, speech
recognition, biometric indentification,
credit scoring
Classification

● There are two classes of objects A & B
● Define the class of new object, based on
information about its neighbors
● Changing the boundaries of an new object
area, we form a set of neighbors.
● New object is B becuase majority of the
neighbors is a B.
kNN (k-nearest neighbor)

● It’s free
● Not full implemented stack of ML
algorythms
● All your matrix are belong to us!
● Single thread model
● Hard way to integrate something
Why not Octave?

● Old ancient community
● Syntax is too sweet
● You should read 1000 lines in docs to
write 1 line of code
● Single thread model for 95%
algorythms
● Not good for your production
Why not R?

● Now Python is an idol for young scientists
due to the low barrier to entry
● Python is a glue
● It’s solving two-language problem
● Have you ever heard about a Jython?
● Not so long way to real Highload
production
Why not Python?

● A fast and efficient multidimensional
array object ndarray
● Tools for reading and writing array-based
data sets to disk
● Linear algebra operations, Fourier
transform
● Tools for integration C, C++, and Fortran
Son of the Fortran : NumPy

● Sparse matrix, signal processing tools,
optimization, differential equation
● Integration of old Fortran libraries
● Tool for using inline C++ code to
accelerate array computations
● Powerful stat and discrete probability
distributions
SciPy as a servant

● The DataFrame, a twodimensional tabular, column-oriented
data structure with both row and column labels
● It is easy to reshape, slice and dice, perform aggregations, and
select subsets (EASY JOIN AND GROUP BY IN CODE!!!!)
● High-performance time series functionality and tools well-
suited for working with financial data
● Read and write CSV, JSON, HTML, XML, HDF5 and special
binary formats
Pandas : what’s new?

All we need in draw (Matplotlib)

● Classification, regression, clustering,
decision trees
● Comparing, validating and choosing
parameters and models
● Feature extraction and normalization
● Encoding categorical features
● Dimensionality Reduction
Scikit-learn

How to start mining your data
52/68

● MapReduce in memory
● Up to 50x faster than Hadoop
● Support for Shark (like Hive), MLlib
(Machine learning), GraphX (graph
processing)
● RDD is a basic building block (immutable
distributed collections of objects)
Spark

● Python is dynamically typed, so RDDs can hold objects of
multiple types
● PySpark does not yet support a few API calls, such as lookup
and non-text input files
● PySpark requires Python 2.6 or higher (not tested with P3)
● PySpark applications are executed using a standard CPython
interpreter
● MLlib is supported, GraphX is not supported
PySpark (Python API for Spark)

● Python language data structures for
graphs, digraphs, and multigraphs
● Generators for classic graphs, random
graphs, and synthetic networks
● Analyzing : clustering, independent set,
closeness, betweeness, connectivity,
transitivity, flows, isomorphism, Page
Rank
NetworkX

Graph Number of
vertexes
Number of
edges
Volume Data/per day
Web-graph 1,5 * 10^12 1,2 * 10^13 100 PB 300 TB
Facebook
(friends
graph)
1,1 * 10^9 160 * 10^9 1 PB 15 TB
Road graph
of EU
18 * 10^6 42 * 10^6 20 GB 50 MB
Road graph
of this city
250 000 460 000 500 MB 100 KB

● Leaderboard
● How accurate does my model really need to
be? What data can I use? (bussiness
problem)
● Good clear data for unbelievable situations
● Fun & enjoyment building model
● Kaggle is too far from your production
Machine learning isn't Kaggle competitions

Size Classification Tools
Lines
Sample Data
Analysis and Visualization Whiteboard, bash
KBs - low MBs
Prototype Data
Analysis and Visualization Matlab, Octave, R
MBs - low GBs
Online Data
Storage MySQL (DBs)
MBs - low GBs
Online Data
Analysis NumPy, SciPy, Weka, BLAS/LAPACK
GBs - TBs - PBs
BigData
Storage HDFS, HBase, Cassandra
GBs - TBs - PBs
Big Data
Analysis Hive, Mahout, Hama, Giraph,MLlib

Mailing lists, google groups: pydata, pystatsmodels, numpy-
discussion, scipy-user
Conferences: PyCon, EuroPython and its friends SciPy and
EuroSciPy
Books: Python for Data Analysis, Mastering Machine Learning with
scikit-learn
Community, conferences and books

Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...

Similar to Python's slippy path and Tao of thick Pandas: give my data, Rrrrr... (20)

More from Alexey Zinoviev

More from Alexey Zinoviev (20)

Recently uploaded

Recently uploaded (20)

Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...