SlideShare a Scribd company logo
Speaker : Alexey Zinoviev
Python's slippy path and Tao of thick
Pandas: give my data, Rrrrr...
About
● I am a <graph theory, machine learning, traffic jams prediction,
BigData algorythms> scientist
● But I'm a <Java, Scala, Python, NoSQL, Hadoop, Spark>
programmer
One of these fine days...
We need in R dev 'cause Data Mining
You're a programmer, not an analyst
Write your code!
Let’s talk about it, Python-boy...
Data mining
Mining coal
in your data
Hey, man, predict me something!
Man or sofa?
● Which loan applicants are high-risk?
● How do we detect phone card fraud?
● Which customers do prefer product A over product B?
● What is the revenue prediction for next year?
Typical questions for DM
What is Data Mining?
12/68
Statistics?
Tag cloud?
Data visualization?
Not OLAP, 100%
1. Selection
2. Pre-processing
3. Transformation
4. Data Mining
5. Interpretation/Evaluation
Magic part of KDD (Knowledge
Discovery in Databases)
1. Share your date with us
2. Our magic manipulations
3. Building an answering machine
4. PROFIT!!!
How it really works
Data
19/68
● Facebook users, tweets
● Weather
● Sea routes
● Trade transactions
● Goverment
● Medicine (genomic data)
● Telecommuncations (phone call records)
Data examples
● Relational Databases
● Data warehouses (Historical data)
● Files in CSV or in binary format
● Internet or electronic mails
● Scientific, research (R, Octave,
Matlab)
Data sources
Target Data & Personal Data
22/68
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
● All your personal data (PD) are being
deeply mined
● The industry of collecting, aggregating,
and brokering PD is “database marketing.”
● 1.1 billion browser cookies, 200 million
mobile profiles, and an average of 1,500
pieces of data per consumer in Acxiom
Pay with your personal data
Preprocessing
25/68
● Select small pieces
● Define default values for missed data
● Remove strange signals from data
● Merge some tables in one if required
Pattern mining
27/68
Association rule learning
It is the process of finding model of function that describes and
distinguishes data class to predict the class of objects whose class
label is unknown.
What is Cluster Analysis?
● Statistical process for estimating the
relationships among variables
● The estimation target is function (it can
be probability distribution)
● Can be linear, polynomial, nonlinear and
etc.
Regression
● Training set of classified examples
(supervised learning)
● Test set of non-classified items
● Main goal: find a function (classifier) that
maps input data to a category
● Computer vision, drug discovery, speech
recognition, biometric indentification,
credit scoring
Classification
Decision trees
Decision trees
● There are two classes of objects A & B
● Define the class of new object, based on
information about its neighbors
● Changing the boundaries of an new object
area, we form a set of neighbors.
● New object is B becuase majority of the
neighbors is a B.
kNN (k-nearest neighbor)
Skills & Tools
35/68
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Fashion Languages
38/68
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
● It’s free
● Not full implemented stack of ML
algorythms
● All your matrix are belong to us!
● Single thread model
● Hard way to integrate something
Why not Octave?
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
● Old ancient community
● Syntax is too sweet
● You should read 1000 lines in docs to
write 1 line of code
● Single thread model for 95%
algorythms
● Not good for your production
Why not R?
● Now Python is an idol for young scientists
due to the low barrier to entry
● Python is a glue
● It’s solving two-language problem
● Have you ever heard about a Jython?
● Not so long way to real Highload
production
Why not Python?
Frameworks
44/68
● A fast and efficient multidimensional
array object ndarray
● Tools for reading and writing array-based
data sets to disk
● Linear algebra operations, Fourier
transform
● Tools for integration C, C++, and Fortran
Son of the Fortran : NumPy
● Sparse matrix, signal processing tools,
optimization, differential equation
● Integration of old Fortran libraries
● Tool for using inline C++ code to
accelerate array computations
● Powerful stat and discrete probability
distributions
SciPy as a servant
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
● The DataFrame, a twodimensional tabular, column-oriented
data structure with both row and column labels
● It is easy to reshape, slice and dice, perform aggregations, and
select subsets (EASY JOIN AND GROUP BY IN CODE!!!!)
● High-performance time series functionality and tools well-
suited for working with financial data
● Read and write CSV, JSON, HTML, XML, HDF5 and special
binary formats
Pandas : what’s new?
All we need in draw (Matplotlib)
● The DataFrame, a twodimensional tabular, column-oriented
data structure with both row and column labels
● It is easy to reshape, slice and dice, perform aggregations, and
select subsets (EASY JOIN AND GROUP BY IN CODE!!!!)
● High-performance time series functionality and tools well-
suited for working with financial data
● Read and write CSV, JSON, HTML, XML, HDF5 and special
binary formats
Pandas : what’s new?
● Classification, regression, clustering,
decision trees
● Comparing, validating and choosing
parameters and models
● Feature extraction and normalization
● Encoding categorical features
● Dimensionality Reduction
Scikit-learn
How to start mining your data
52/68
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Weka + Hadoop
PySpark
56/68
● MapReduce in memory
● Up to 50x faster than Hadoop
● Support for Shark (like Hive), MLlib
(Machine learning), GraphX (graph
processing)
● RDD is a basic building block (immutable
distributed collections of objects)
Spark
Spark
● Python is dynamically typed, so RDDs can hold objects of
multiple types
● PySpark does not yet support a few API calls, such as lookup
and non-text input files
● PySpark requires Python 2.6 or higher (not tested with P3)
● PySpark applications are executed using a standard CPython
interpreter
● MLlib is supported, GraphX is not supported
PySpark (Python API for Spark)
Graph processing tools
60/68
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
● Python language data structures for
graphs, digraphs, and multigraphs
● Generators for classic graphs, random
graphs, and synthetic networks
● Analyzing : clustering, independent set,
closeness, betweeness, connectivity,
transitivity, flows, isomorphism, Page
Rank
NetworkX
Graph Number of
vertexes
Number of
edges
Volume Data/per day
Web-graph 1,5 * 10^12 1,2 * 10^13 100 PB 300 TB
Facebook
(friends
graph)
1,1 * 10^9 160 * 10^9 1 PB 15 TB
Road graph
of EU
18 * 10^6 42 * 10^6 20 GB 50 MB
Road graph
of this city
250 000 460 000 500 MB 100 KB
● Leaderboard
● How accurate does my model really need to
be? What data can I use? (bussiness
problem)
● Good clear data for unbelievable situations
● Fun & enjoyment building model
● Kaggle is too far from your production
Machine learning isn't Kaggle competitions
Size Classification Tools
Lines
Sample Data
Analysis and Visualization Whiteboard, bash
KBs - low MBs
Prototype Data
Analysis and Visualization Matlab, Octave, R
MBs - low GBs
Online Data
Storage MySQL (DBs)
MBs - low GBs
Online Data
Analysis NumPy, SciPy, Weka, BLAS/LAPACK
GBs - TBs - PBs
BigData
Storage HDFS, HBase, Cassandra
GBs - TBs - PBs
Big Data
Analysis Hive, Mahout, Hama, Giraph,MLlib
Mailing lists, google groups: pydata, pystatsmodels, numpy-
discussion, scipy-user
Conferences: PyCon, EuroPython and its friends SciPy and
EuroSciPy
Books: Python for Data Analysis, Mastering Machine Learning with
scikit-learn
Community, conferences and books
Your questions?

More Related Content

What's hot

Data Processing over very Large Relational Databases
Data Processing over very Large Relational DatabasesData Processing over very Large Relational Databases
Data Processing over very Large Relational Databases
kvaderlipa
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
National Institute of Informatics
 
Data quality in Real Estate
Data quality in Real EstateData quality in Real Estate
Data quality in Real Estate
Dimitris Kontokostas
 
PPT3: Main algorithms and techniques required for implementing Machine Learni...
PPT3: Main algorithms and techniques required for implementing Machine Learni...PPT3: Main algorithms and techniques required for implementing Machine Learni...
PPT3: Main algorithms and techniques required for implementing Machine Learni...
akira-ai
 
Querying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysisQuerying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysis
Peter Bouda
 
Semantic Web Technology
Semantic Web TechnologySemantic Web Technology
Semantic Web Technology
Rathachai Chawuthai
 
LD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and toolsLD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and tools
Vrije Universiteit Amsterdam
 
aRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RaRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con R
GraphRM
 
Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020
Ontotext
 
Why is JSON-LD Important to Businesses - Franz Inc
Why is JSON-LD Important to Businesses - Franz IncWhy is JSON-LD Important to Businesses - Franz Inc
Why is JSON-LD Important to Businesses - Franz Inc
Franz Inc. - AllegroGraph
 
JSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge GraphsJSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge Graphs
Franz Inc. - AllegroGraph
 
Social media monitoring with ML-powered Knowledge Graph
Social media monitoring with ML-powered Knowledge GraphSocial media monitoring with ML-powered Knowledge Graph
Social media monitoring with ML-powered Knowledge Graph
GraphAware
 
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sfSparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Harsh Thakkar
 
HyperGraphQL
HyperGraphQLHyperGraphQL
HyperGraphQL
Szymon Klarman
 
Regal - a Repository for Electronic Documents and Bibliographic Data
Regal - a Repository for Electronic Documents and Bibliographic DataRegal - a Repository for Electronic Documents and Bibliographic Data
Regal - a Repository for Electronic Documents and Bibliographic Data
Felix Ostrowski
 
Semantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQLSemantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQL
Jerven Bolleman
 
Data Visulalization
Data VisulalizationData Visulalization
Data Visulalization
Girish Khanzode
 
Semantics 2017 - Trying Not to Die Benchmarking using LITMUS
Semantics 2017 - Trying Not to Die Benchmarking using LITMUSSemantics 2017 - Trying Not to Die Benchmarking using LITMUS
Semantics 2017 - Trying Not to Die Benchmarking using LITMUS
Harsh Thakkar
 
Essentials of R
Essentials of REssentials of R
Essentials of R
ExternalEvents
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
Ontotext
 

What's hot (20)

Data Processing over very Large Relational Databases
Data Processing over very Large Relational DatabasesData Processing over very Large Relational Databases
Data Processing over very Large Relational Databases
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
 
Data quality in Real Estate
Data quality in Real EstateData quality in Real Estate
Data quality in Real Estate
 
PPT3: Main algorithms and techniques required for implementing Machine Learni...
PPT3: Main algorithms and techniques required for implementing Machine Learni...PPT3: Main algorithms and techniques required for implementing Machine Learni...
PPT3: Main algorithms and techniques required for implementing Machine Learni...
 
Querying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysisQuerying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysis
 
Semantic Web Technology
Semantic Web TechnologySemantic Web Technology
Semantic Web Technology
 
LD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and toolsLD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and tools
 
aRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RaRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con R
 
Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020
 
Why is JSON-LD Important to Businesses - Franz Inc
Why is JSON-LD Important to Businesses - Franz IncWhy is JSON-LD Important to Businesses - Franz Inc
Why is JSON-LD Important to Businesses - Franz Inc
 
JSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge GraphsJSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge Graphs
 
Social media monitoring with ML-powered Knowledge Graph
Social media monitoring with ML-powered Knowledge GraphSocial media monitoring with ML-powered Knowledge Graph
Social media monitoring with ML-powered Knowledge Graph
 
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sfSparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
 
HyperGraphQL
HyperGraphQLHyperGraphQL
HyperGraphQL
 
Regal - a Repository for Electronic Documents and Bibliographic Data
Regal - a Repository for Electronic Documents and Bibliographic DataRegal - a Repository for Electronic Documents and Bibliographic Data
Regal - a Repository for Electronic Documents and Bibliographic Data
 
Semantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQLSemantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQL
 
Data Visulalization
Data VisulalizationData Visulalization
Data Visulalization
 
Semantics 2017 - Trying Not to Die Benchmarking using LITMUS
Semantics 2017 - Trying Not to Die Benchmarking using LITMUSSemantics 2017 - Trying Not to Die Benchmarking using LITMUS
Semantics 2017 - Trying Not to Die Benchmarking using LITMUS
 
Essentials of R
Essentials of REssentials of R
Essentials of R
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
 

Similar to Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...

Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
Alexey Zinoviev
 
First steps in Data Mining Kindergarten
First steps in Data Mining KindergartenFirst steps in Data Mining Kindergarten
First steps in Data Mining Kindergarten
Alexey Zinoviev
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn
 
Data science
Data scienceData science
Data science
Purna Chander
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 Sessions
BigML, Inc
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
Pramod Toraskar
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 
Data analysis with Pandas and Spark
Data analysis with Pandas and SparkData analysis with Pandas and Spark
Data analysis with Pandas and Spark
Felix Crisan
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
Petr Zapletal
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
datamantra
 
LSESU a Taste of R Language Workshop
LSESU a Taste of R Language WorkshopLSESU a Taste of R Language Workshop
LSESU a Taste of R Language Workshop
Korkrid Akepanidtaworn
 
L15.pptx
L15.pptxL15.pptx
L15.pptx
ImonBennett
 
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
TigerGraph
 
Python for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive GuidePython for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive Guide
priyanka rajput
 
General introduction to AI ML DL DS
General introduction to AI ML DL DSGeneral introduction to AI ML DL DS
General introduction to AI ML DL DS
Roopesh Kohad
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
Konstantinos Xirogiannopoulos
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
PyData
 
Big Data Science in Scala V2
Big Data Science in Scala V2 Big Data Science in Scala V2
Big Data Science in Scala V2
Anastasia Bobyreva
 
Python and data analytics
Python and data analyticsPython and data analytics

Similar to Python's slippy path and Tao of thick Pandas: give my data, Rrrrr... (20)

Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
 
First steps in Data Mining Kindergarten
First steps in Data Mining KindergartenFirst steps in Data Mining Kindergarten
First steps in Data Mining Kindergarten
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Data science
Data scienceData science
Data science
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 Sessions
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Data analysis with Pandas and Spark
Data analysis with Pandas and SparkData analysis with Pandas and Spark
Data analysis with Pandas and Spark
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
LSESU a Taste of R Language Workshop
LSESU a Taste of R Language WorkshopLSESU a Taste of R Language Workshop
LSESU a Taste of R Language Workshop
 
L15.pptx
L15.pptxL15.pptx
L15.pptx
 
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
 
Python for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive GuidePython for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive Guide
 
General introduction to AI ML DL DS
General introduction to AI ML DL DSGeneral introduction to AI ML DL DS
General introduction to AI ML DL DS
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
 
Big Data Science in Scala V2
Big Data Science in Scala V2 Big Data Science in Scala V2
Big Data Science in Scala V2
 
Python and data analytics
Python and data analyticsPython and data analytics
Python and data analytics
 

More from Alexey Zinoviev

Kafka pours and Spark resolves
Kafka pours and Spark resolvesKafka pours and Spark resolves
Kafka pours and Spark resolves
Alexey Zinoviev
 
Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)
Alexey Zinoviev
 
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Alexey Zinoviev
 
Hadoop Jungle
Hadoop JungleHadoop Jungle
Hadoop Jungle
Alexey Zinoviev
 
HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...
Alexey Zinoviev
 
Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15
Alexey Zinoviev
 
JavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsJavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projects
Alexey Zinoviev
 
Joker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDBJoker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDB
Alexey Zinoviev
 
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
Alexey Zinoviev
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Alexey Zinoviev
 
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
Alexey Zinoviev
 
Android Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find youAndroid Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find you
Alexey Zinoviev
 
Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8
Alexey Zinoviev
 
Big data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphsBig data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphs
Alexey Zinoviev
 
"Говнокод-шоу"
"Говнокод-шоу""Говнокод-шоу"
"Говнокод-шоу"
Alexey Zinoviev
 
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Alexey Zinoviev
 
Алгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерностиАлгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерности
Alexey Zinoviev
 
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
Alexey Zinoviev
 
GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!
Alexey Zinoviev
 
How to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOSHow to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOS
Alexey Zinoviev
 

More from Alexey Zinoviev (20)

Kafka pours and Spark resolves
Kafka pours and Spark resolvesKafka pours and Spark resolves
Kafka pours and Spark resolves
 
Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)
 
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
 
Hadoop Jungle
Hadoop JungleHadoop Jungle
Hadoop Jungle
 
HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...
 
Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15
 
JavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsJavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projects
 
Joker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDBJoker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDB
 
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
 
Android Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find youAndroid Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find you
 
Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8
 
Big data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphsBig data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphs
 
"Говнокод-шоу"
"Говнокод-шоу""Говнокод-шоу"
"Говнокод-шоу"
 
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
 
Алгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерностиАлгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерности
 
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
 
GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!
 
How to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOSHow to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOS
 

Recently uploaded

ch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ssch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ss
MinThetLwin1
 
potential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in generalpotential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in general
huseindihon
 
Biometric Question Bank 2021 - 1 Soln-1.pdf
Biometric Question Bank 2021 - 1 Soln-1.pdfBiometric Question Bank 2021 - 1 Soln-1.pdf
Biometric Question Bank 2021 - 1 Soln-1.pdf
Joel Ngushwai
 
the unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithmthe unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithm
huseindihon
 
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
birajmohan012
 
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
44annissa
 
Nipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma TranscriptNipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma Transcript
zyqedad
 
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
vrvipin164
 
DU degree offer diploma Transcript
DU degree offer diploma TranscriptDU degree offer diploma Transcript
DU degree offer diploma Transcript
uapta
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
Kanchana Weerasinghe
 
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
bhupeshkumar0889
 
Research proposal seminar ,Research Methodology
Research proposal seminar ,Research MethodologyResearch proposal seminar ,Research Methodology
Research proposal seminar ,Research Methodology
doctorzlife786
 
Willis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdfWillis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdf
LINAT
 
M44.pdf dairy management farm report of an
M44.pdf dairy management farm report of anM44.pdf dairy management farm report of an
M44.pdf dairy management farm report of an
ManjuBv2
 
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
tanupasswan6
 
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
janvikumar4133
 
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured DataFine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
kevig
 
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
NABLAS株式会社
 
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
sheetal singh$A17
 
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
revolutionary575
 

Recently uploaded (20)

ch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ssch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ss
 
potential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in generalpotential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in general
 
Biometric Question Bank 2021 - 1 Soln-1.pdf
Biometric Question Bank 2021 - 1 Soln-1.pdfBiometric Question Bank 2021 - 1 Soln-1.pdf
Biometric Question Bank 2021 - 1 Soln-1.pdf
 
the unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithmthe unexpected potential of Dijkstra's Algorithm
the unexpected potential of Dijkstra's Algorithm
 
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
 
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
 
Nipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma TranscriptNipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma Transcript
 
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
 
DU degree offer diploma Transcript
DU degree offer diploma TranscriptDU degree offer diploma Transcript
DU degree offer diploma Transcript
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
 
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
 
Research proposal seminar ,Research Methodology
Research proposal seminar ,Research MethodologyResearch proposal seminar ,Research Methodology
Research proposal seminar ,Research Methodology
 
Willis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdfWillis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdf
 
M44.pdf dairy management farm report of an
M44.pdf dairy management farm report of anM44.pdf dairy management farm report of an
M44.pdf dairy management farm report of an
 
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
New Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And N...
 
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
 
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured DataFine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
 
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
 
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
 
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
Verified Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servic...
 

Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...

  • 1. Speaker : Alexey Zinoviev Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
  • 2. About ● I am a <graph theory, machine learning, traffic jams prediction, BigData algorythms> scientist ● But I'm a <Java, Scala, Python, NoSQL, Hadoop, Spark> programmer
  • 3. One of these fine days...
  • 4. We need in R dev 'cause Data Mining
  • 5. You're a programmer, not an analyst
  • 7. Let’s talk about it, Python-boy...
  • 9. Hey, man, predict me something!
  • 11. ● Which loan applicants are high-risk? ● How do we detect phone card fraud? ● Which customers do prefer product A over product B? ● What is the revenue prediction for next year? Typical questions for DM
  • 12. What is Data Mining? 12/68
  • 17. 1. Selection 2. Pre-processing 3. Transformation 4. Data Mining 5. Interpretation/Evaluation Magic part of KDD (Knowledge Discovery in Databases)
  • 18. 1. Share your date with us 2. Our magic manipulations 3. Building an answering machine 4. PROFIT!!! How it really works
  • 20. ● Facebook users, tweets ● Weather ● Sea routes ● Trade transactions ● Goverment ● Medicine (genomic data) ● Telecommuncations (phone call records) Data examples
  • 21. ● Relational Databases ● Data warehouses (Historical data) ● Files in CSV or in binary format ● Internet or electronic mails ● Scientific, research (R, Octave, Matlab) Data sources
  • 22. Target Data & Personal Data 22/68
  • 24. ● All your personal data (PD) are being deeply mined ● The industry of collecting, aggregating, and brokering PD is “database marketing.” ● 1.1 billion browser cookies, 200 million mobile profiles, and an average of 1,500 pieces of data per consumer in Acxiom Pay with your personal data
  • 26. ● Select small pieces ● Define default values for missed data ● Remove strange signals from data ● Merge some tables in one if required
  • 29. It is the process of finding model of function that describes and distinguishes data class to predict the class of objects whose class label is unknown. What is Cluster Analysis?
  • 30. ● Statistical process for estimating the relationships among variables ● The estimation target is function (it can be probability distribution) ● Can be linear, polynomial, nonlinear and etc. Regression
  • 31. ● Training set of classified examples (supervised learning) ● Test set of non-classified items ● Main goal: find a function (classifier) that maps input data to a category ● Computer vision, drug discovery, speech recognition, biometric indentification, credit scoring Classification
  • 34. ● There are two classes of objects A & B ● Define the class of new object, based on information about its neighbors ● Changing the boundaries of an new object area, we form a set of neighbors. ● New object is B becuase majority of the neighbors is a B. kNN (k-nearest neighbor)
  • 40. ● It’s free ● Not full implemented stack of ML algorythms ● All your matrix are belong to us! ● Single thread model ● Hard way to integrate something Why not Octave?
  • 42. ● Old ancient community ● Syntax is too sweet ● You should read 1000 lines in docs to write 1 line of code ● Single thread model for 95% algorythms ● Not good for your production Why not R?
  • 43. ● Now Python is an idol for young scientists due to the low barrier to entry ● Python is a glue ● It’s solving two-language problem ● Have you ever heard about a Jython? ● Not so long way to real Highload production Why not Python?
  • 45. ● A fast and efficient multidimensional array object ndarray ● Tools for reading and writing array-based data sets to disk ● Linear algebra operations, Fourier transform ● Tools for integration C, C++, and Fortran Son of the Fortran : NumPy
  • 46. ● Sparse matrix, signal processing tools, optimization, differential equation ● Integration of old Fortran libraries ● Tool for using inline C++ code to accelerate array computations ● Powerful stat and discrete probability distributions SciPy as a servant
  • 48. ● The DataFrame, a twodimensional tabular, column-oriented data structure with both row and column labels ● It is easy to reshape, slice and dice, perform aggregations, and select subsets (EASY JOIN AND GROUP BY IN CODE!!!!) ● High-performance time series functionality and tools well- suited for working with financial data ● Read and write CSV, JSON, HTML, XML, HDF5 and special binary formats Pandas : what’s new?
  • 49. All we need in draw (Matplotlib)
  • 50. ● The DataFrame, a twodimensional tabular, column-oriented data structure with both row and column labels ● It is easy to reshape, slice and dice, perform aggregations, and select subsets (EASY JOIN AND GROUP BY IN CODE!!!!) ● High-performance time series functionality and tools well- suited for working with financial data ● Read and write CSV, JSON, HTML, XML, HDF5 and special binary formats Pandas : what’s new?
  • 51. ● Classification, regression, clustering, decision trees ● Comparing, validating and choosing parameters and models ● Feature extraction and normalization ● Encoding categorical features ● Dimensionality Reduction Scikit-learn
  • 52. How to start mining your data 52/68
  • 57. ● MapReduce in memory ● Up to 50x faster than Hadoop ● Support for Shark (like Hive), MLlib (Machine learning), GraphX (graph processing) ● RDD is a basic building block (immutable distributed collections of objects) Spark
  • 58. Spark
  • 59. ● Python is dynamically typed, so RDDs can hold objects of multiple types ● PySpark does not yet support a few API calls, such as lookup and non-text input files ● PySpark requires Python 2.6 or higher (not tested with P3) ● PySpark applications are executed using a standard CPython interpreter ● MLlib is supported, GraphX is not supported PySpark (Python API for Spark)
  • 63. ● Python language data structures for graphs, digraphs, and multigraphs ● Generators for classic graphs, random graphs, and synthetic networks ● Analyzing : clustering, independent set, closeness, betweeness, connectivity, transitivity, flows, isomorphism, Page Rank NetworkX
  • 64. Graph Number of vertexes Number of edges Volume Data/per day Web-graph 1,5 * 10^12 1,2 * 10^13 100 PB 300 TB Facebook (friends graph) 1,1 * 10^9 160 * 10^9 1 PB 15 TB Road graph of EU 18 * 10^6 42 * 10^6 20 GB 50 MB Road graph of this city 250 000 460 000 500 MB 100 KB
  • 65. ● Leaderboard ● How accurate does my model really need to be? What data can I use? (bussiness problem) ● Good clear data for unbelievable situations ● Fun & enjoyment building model ● Kaggle is too far from your production Machine learning isn't Kaggle competitions
  • 66. Size Classification Tools Lines Sample Data Analysis and Visualization Whiteboard, bash KBs - low MBs Prototype Data Analysis and Visualization Matlab, Octave, R MBs - low GBs Online Data Storage MySQL (DBs) MBs - low GBs Online Data Analysis NumPy, SciPy, Weka, BLAS/LAPACK GBs - TBs - PBs BigData Storage HDFS, HBase, Cassandra GBs - TBs - PBs Big Data Analysis Hive, Mahout, Hama, Giraph,MLlib
  • 67. Mailing lists, google groups: pydata, pystatsmodels, numpy- discussion, scipy-user Conferences: PyCon, EuroPython and its friends SciPy and EuroSciPy Books: Python for Data Analysis, Mastering Machine Learning with scikit-learn Community, conferences and books