SlideShare a Scribd company logo
1 of 68
Speaker : Alexey Zinoviev
Python's slippy path and Tao of thick
Pandas: give my data, Rrrrr...
About
● I am a <graph theory, machine learning, traffic jams prediction,
BigData algorythms> scientist
● But I'm a <Java, Scala, Python, NoSQL, Hadoop, Spark>
programmer
One of these fine days...
We need in R dev 'cause Data Mining
You're a programmer, not an analyst
Write your code!
Let’s talk about it, Python-boy...
Data mining
Mining coal
in your data
Hey, man, predict me something!
Man or sofa?
● Which loan applicants are high-risk?
● How do we detect phone card fraud?
● Which customers do prefer product A over product B?
● What is the revenue prediction for next year?
Typical questions for DM
What is Data Mining?
12/68
Statistics?
Tag cloud?
Data visualization?
Not OLAP, 100%
1. Selection
2. Pre-processing
3. Transformation
4. Data Mining
5. Interpretation/Evaluation
Magic part of KDD (Knowledge
Discovery in Databases)
1. Share your date with us
2. Our magic manipulations
3. Building an answering machine
4. PROFIT!!!
How it really works
Data
19/68
● Facebook users, tweets
● Weather
● Sea routes
● Trade transactions
● Goverment
● Medicine (genomic data)
● Telecommuncations (phone call records)
Data examples
● Relational Databases
● Data warehouses (Historical data)
● Files in CSV or in binary format
● Internet or electronic mails
● Scientific, research (R, Octave,
Matlab)
Data sources
Target Data & Personal Data
22/68
● All your personal data (PD) are being
deeply mined
● The industry of collecting, aggregating,
and brokering PD is “database marketing.”
● 1.1 billion browser cookies, 200 million
mobile profiles, and an average of 1,500
pieces of data per consumer in Acxiom
Pay with your personal data
Preprocessing
25/68
● Select small pieces
● Define default values for missed data
● Remove strange signals from data
● Merge some tables in one if required
Pattern mining
27/68
Association rule learning
It is the process of finding model of function that describes and
distinguishes data class to predict the class of objects whose class
label is unknown.
What is Cluster Analysis?
● Statistical process for estimating the
relationships among variables
● The estimation target is function (it can
be probability distribution)
● Can be linear, polynomial, nonlinear and
etc.
Regression
● Training set of classified examples
(supervised learning)
● Test set of non-classified items
● Main goal: find a function (classifier) that
maps input data to a category
● Computer vision, drug discovery, speech
recognition, biometric indentification,
credit scoring
Classification
Decision trees
Decision trees
● There are two classes of objects A & B
● Define the class of new object, based on
information about its neighbors
● Changing the boundaries of an new object
area, we form a set of neighbors.
● New object is B becuase majority of the
neighbors is a B.
kNN (k-nearest neighbor)
Skills & Tools
35/68
Fashion Languages
38/68
● It’s free
● Not full implemented stack of ML
algorythms
● All your matrix are belong to us!
● Single thread model
● Hard way to integrate something
Why not Octave?
● Old ancient community
● Syntax is too sweet
● You should read 1000 lines in docs to
write 1 line of code
● Single thread model for 95%
algorythms
● Not good for your production
Why not R?
● Now Python is an idol for young scientists
due to the low barrier to entry
● Python is a glue
● It’s solving two-language problem
● Have you ever heard about a Jython?
● Not so long way to real Highload
production
Why not Python?
Frameworks
44/68
● A fast and efficient multidimensional
array object ndarray
● Tools for reading and writing array-based
data sets to disk
● Linear algebra operations, Fourier
transform
● Tools for integration C, C++, and Fortran
Son of the Fortran : NumPy
● Sparse matrix, signal processing tools,
optimization, differential equation
● Integration of old Fortran libraries
● Tool for using inline C++ code to
accelerate array computations
● Powerful stat and discrete probability
distributions
SciPy as a servant
● The DataFrame, a twodimensional tabular, column-oriented
data structure with both row and column labels
● It is easy to reshape, slice and dice, perform aggregations, and
select subsets (EASY JOIN AND GROUP BY IN CODE!!!!)
● High-performance time series functionality and tools well-
suited for working with financial data
● Read and write CSV, JSON, HTML, XML, HDF5 and special
binary formats
Pandas : what’s new?
All we need in draw (Matplotlib)
● The DataFrame, a twodimensional tabular, column-oriented
data structure with both row and column labels
● It is easy to reshape, slice and dice, perform aggregations, and
select subsets (EASY JOIN AND GROUP BY IN CODE!!!!)
● High-performance time series functionality and tools well-
suited for working with financial data
● Read and write CSV, JSON, HTML, XML, HDF5 and special
binary formats
Pandas : what’s new?
● Classification, regression, clustering,
decision trees
● Comparing, validating and choosing
parameters and models
● Feature extraction and normalization
● Encoding categorical features
● Dimensionality Reduction
Scikit-learn
How to start mining your data
52/68
Weka + Hadoop
PySpark
56/68
● MapReduce in memory
● Up to 50x faster than Hadoop
● Support for Shark (like Hive), MLlib
(Machine learning), GraphX (graph
processing)
● RDD is a basic building block (immutable
distributed collections of objects)
Spark
Spark
● Python is dynamically typed, so RDDs can hold objects of
multiple types
● PySpark does not yet support a few API calls, such as lookup
and non-text input files
● PySpark requires Python 2.6 or higher (not tested with P3)
● PySpark applications are executed using a standard CPython
interpreter
● MLlib is supported, GraphX is not supported
PySpark (Python API for Spark)
Graph processing tools
60/68
● Python language data structures for
graphs, digraphs, and multigraphs
● Generators for classic graphs, random
graphs, and synthetic networks
● Analyzing : clustering, independent set,
closeness, betweeness, connectivity,
transitivity, flows, isomorphism, Page
Rank
NetworkX
Graph Number of
vertexes
Number of
edges
Volume Data/per day
Web-graph 1,5 * 10^12 1,2 * 10^13 100 PB 300 TB
Facebook
(friends
graph)
1,1 * 10^9 160 * 10^9 1 PB 15 TB
Road graph
of EU
18 * 10^6 42 * 10^6 20 GB 50 MB
Road graph
of this city
250 000 460 000 500 MB 100 KB
● Leaderboard
● How accurate does my model really need to
be? What data can I use? (bussiness
problem)
● Good clear data for unbelievable situations
● Fun & enjoyment building model
● Kaggle is too far from your production
Machine learning isn't Kaggle competitions
Size Classification Tools
Lines
Sample Data
Analysis and Visualization Whiteboard, bash
KBs - low MBs
Prototype Data
Analysis and Visualization Matlab, Octave, R
MBs - low GBs
Online Data
Storage MySQL (DBs)
MBs - low GBs
Online Data
Analysis NumPy, SciPy, Weka, BLAS/LAPACK
GBs - TBs - PBs
BigData
Storage HDFS, HBase, Cassandra
GBs - TBs - PBs
Big Data
Analysis Hive, Mahout, Hama, Giraph,MLlib
Mailing lists, google groups: pydata, pystatsmodels, numpy-
discussion, scipy-user
Conferences: PyCon, EuroPython and its friends SciPy and
EuroSciPy
Books: Python for Data Analysis, Mastering Machine Learning with
scikit-learn
Community, conferences and books
Your questions?

More Related Content

What's hot

Data Processing over very Large Relational Databases
Data Processing over very Large Relational DatabasesData Processing over very Large Relational Databases
Data Processing over very Large Relational Databaseskvaderlipa
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeNational Institute of Informatics
 
PPT3: Main algorithms and techniques required for implementing Machine Learni...
PPT3: Main algorithms and techniques required for implementing Machine Learni...PPT3: Main algorithms and techniques required for implementing Machine Learni...
PPT3: Main algorithms and techniques required for implementing Machine Learni...akira-ai
 
Querying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysisQuerying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysisPeter Bouda
 
aRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RaRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RGraphRM
 
Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020Ontotext
 
Why is JSON-LD Important to Businesses - Franz Inc
Why is JSON-LD Important to Businesses - Franz IncWhy is JSON-LD Important to Businesses - Franz Inc
Why is JSON-LD Important to Businesses - Franz IncFranz Inc. - AllegroGraph
 
Social media monitoring with ML-powered Knowledge Graph
Social media monitoring with ML-powered Knowledge GraphSocial media monitoring with ML-powered Knowledge Graph
Social media monitoring with ML-powered Knowledge GraphGraphAware
 
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sfSparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sfHarsh Thakkar
 
Regal - a Repository for Electronic Documents and Bibliographic Data
Regal - a Repository for Electronic Documents and Bibliographic DataRegal - a Repository for Electronic Documents and Bibliographic Data
Regal - a Repository for Electronic Documents and Bibliographic DataFelix Ostrowski
 
Semantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQLSemantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQLJerven Bolleman
 
Semantics 2017 - Trying Not to Die Benchmarking using LITMUS
Semantics 2017 - Trying Not to Die Benchmarking using LITMUSSemantics 2017 - Trying Not to Die Benchmarking using LITMUS
Semantics 2017 - Trying Not to Die Benchmarking using LITMUSHarsh Thakkar
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudOntotext
 

What's hot (20)

Data Processing over very Large Relational Databases
Data Processing over very Large Relational DatabasesData Processing over very Large Relational Databases
Data Processing over very Large Relational Databases
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
 
Data quality in Real Estate
Data quality in Real EstateData quality in Real Estate
Data quality in Real Estate
 
PPT3: Main algorithms and techniques required for implementing Machine Learni...
PPT3: Main algorithms and techniques required for implementing Machine Learni...PPT3: Main algorithms and techniques required for implementing Machine Learni...
PPT3: Main algorithms and techniques required for implementing Machine Learni...
 
Querying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysisQuerying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysis
 
Semantic Web Technology
Semantic Web TechnologySemantic Web Technology
Semantic Web Technology
 
LD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and toolsLD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and tools
 
aRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RaRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con R
 
Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020Property graph vs. RDF Triplestore comparison in 2020
Property graph vs. RDF Triplestore comparison in 2020
 
Why is JSON-LD Important to Businesses - Franz Inc
Why is JSON-LD Important to Businesses - Franz IncWhy is JSON-LD Important to Businesses - Franz Inc
Why is JSON-LD Important to Businesses - Franz Inc
 
JSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge GraphsJSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge Graphs
 
Social media monitoring with ML-powered Knowledge Graph
Social media monitoring with ML-powered Knowledge GraphSocial media monitoring with ML-powered Knowledge Graph
Social media monitoring with ML-powered Knowledge Graph
 
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sfSparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
Sparql querying of-property-graphs-harsh thakkar-graph day 2017 sf
 
HyperGraphQL
HyperGraphQLHyperGraphQL
HyperGraphQL
 
Regal - a Repository for Electronic Documents and Bibliographic Data
Regal - a Repository for Electronic Documents and Bibliographic DataRegal - a Repository for Electronic Documents and Bibliographic Data
Regal - a Repository for Electronic Documents and Bibliographic Data
 
Semantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQLSemantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQL
 
Data Visulalization
Data VisulalizationData Visulalization
Data Visulalization
 
Semantics 2017 - Trying Not to Die Benchmarking using LITMUS
Semantics 2017 - Trying Not to Die Benchmarking using LITMUSSemantics 2017 - Trying Not to Die Benchmarking using LITMUS
Semantics 2017 - Trying Not to Die Benchmarking using LITMUS
 
Essentials of R
Essentials of REssentials of R
Essentials of R
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
 

Similar to Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...

Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistAlexey Zinoviev
 
First steps in Data Mining Kindergarten
First steps in Data Mining KindergartenFirst steps in Data Mining Kindergarten
First steps in Data Mining KindergartenAlexey Zinoviev
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBigML, Inc
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysisPramod Toraskar
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Data analysis with Pandas and Spark
Data analysis with Pandas and SparkData analysis with Pandas and Spark
Data analysis with Pandas and SparkFelix Crisan
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkPetr Zapletal
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...TigerGraph
 
Python for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive GuidePython for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive Guidepriyanka rajput
 
General introduction to AI ML DL DS
General introduction to AI ML DL DSGeneral introduction to AI ML DL DS
General introduction to AI ML DL DSRoopesh Kohad
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesKonstantinos Xirogiannopoulos
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesPyData
 

Similar to Python's slippy path and Tao of thick Pandas: give my data, Rrrrr... (20)

Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
 
First steps in Data Mining Kindergarten
First steps in Data Mining KindergartenFirst steps in Data Mining Kindergarten
First steps in Data Mining Kindergarten
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Data science
Data scienceData science
Data science
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 Sessions
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Data analysis with Pandas and Spark
Data analysis with Pandas and SparkData analysis with Pandas and Spark
Data analysis with Pandas and Spark
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
LSESU a Taste of R Language Workshop
LSESU a Taste of R Language WorkshopLSESU a Taste of R Language Workshop
LSESU a Taste of R Language Workshop
 
L15.pptx
L15.pptxL15.pptx
L15.pptx
 
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
 
Python for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive GuidePython for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive Guide
 
General introduction to AI ML DL DS
General introduction to AI ML DL DSGeneral introduction to AI ML DL DS
General introduction to AI ML DL DS
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
 
Big Data Science in Scala V2
Big Data Science in Scala V2 Big Data Science in Scala V2
Big Data Science in Scala V2
 
Python and data analytics
Python and data analyticsPython and data analytics
Python and data analytics
 

More from Alexey Zinoviev

Kafka pours and Spark resolves
Kafka pours and Spark resolvesKafka pours and Spark resolves
Kafka pours and Spark resolvesAlexey Zinoviev
 
Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Alexey Zinoviev
 
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Alexey Zinoviev
 
HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...Alexey Zinoviev
 
Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Alexey Zinoviev
 
JavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsJavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsAlexey Zinoviev
 
Joker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDBJoker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDBAlexey Zinoviev
 
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...Alexey Zinoviev
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev
 
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...Alexey Zinoviev
 
Android Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find youAndroid Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find youAlexey Zinoviev
 
Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8Alexey Zinoviev
 
Big data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphsBig data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphsAlexey Zinoviev
 
"Говнокод-шоу"
"Говнокод-шоу""Говнокод-шоу"
"Говнокод-шоу"Alexey Zinoviev
 
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"Alexey Zinoviev
 
Алгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерностиАлгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерностиAlexey Zinoviev
 
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)Alexey Zinoviev
 
GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!Alexey Zinoviev
 
How to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOSHow to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOSAlexey Zinoviev
 

More from Alexey Zinoviev (20)

Kafka pours and Spark resolves
Kafka pours and Spark resolvesKafka pours and Spark resolves
Kafka pours and Spark resolves
 
Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)
 
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
 
Hadoop Jungle
Hadoop JungleHadoop Jungle
Hadoop Jungle
 
HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...HappyDev'15 Keynote: Когда все данные станут большими...
HappyDev'15 Keynote: Когда все данные станут большими...
 
Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15Мастер-класс по BigData Tools для HappyDev'15
Мастер-класс по BigData Tools для HappyDev'15
 
JavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projectsJavaDayKiev'15 Java in production for Data Mining Research projects
JavaDayKiev'15 Java in production for Data Mining Research projects
 
Joker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDBJoker'15 Java straitjackets for MongoDB
Joker'15 Java straitjackets for MongoDB
 
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
EST: Smart rate (Effective recommendation system for Taxi drivers based on th...
 
Android Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find youAndroid Geo Apps in Soviet Russia: Latitude and longitude find you
Android Geo Apps in Soviet Russia: Latitude and longitude find you
 
Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8Keynote on JavaDay Omsk 2014 about new features in Java 8
Keynote on JavaDay Omsk 2014 about new features in Java 8
 
Big data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphsBig data algorithms and data structures for large scale graphs
Big data algorithms and data structures for large scale graphs
 
"Говнокод-шоу"
"Говнокод-шоу""Говнокод-шоу"
"Говнокод-шоу"
 
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
Выбор NoSQL базы данных для вашего проекта: "Не в свои сани не садись"
 
Алгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерностиАлгоритмы и структуры данных BigData для графов большой размерности
Алгоритмы и структуры данных BigData для графов большой размерности
 
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)ALMADA 2013 (computer science school by Yandex and Microsoft Research)
ALMADA 2013 (computer science school by Yandex and Microsoft Research)
 
GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!GDG Devfest Omsk 2013. Year of events!
GDG Devfest Omsk 2013. Year of events!
 
How to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOSHow to port JavaScript library to Android and iOS
How to port JavaScript library to Android and iOS
 

Recently uploaded

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 

Recently uploaded (20)

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 

Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...

  • 1. Speaker : Alexey Zinoviev Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
  • 2. About ● I am a <graph theory, machine learning, traffic jams prediction, BigData algorythms> scientist ● But I'm a <Java, Scala, Python, NoSQL, Hadoop, Spark> programmer
  • 3. One of these fine days...
  • 4. We need in R dev 'cause Data Mining
  • 5. You're a programmer, not an analyst
  • 7. Let’s talk about it, Python-boy...
  • 9. Hey, man, predict me something!
  • 11. ● Which loan applicants are high-risk? ● How do we detect phone card fraud? ● Which customers do prefer product A over product B? ● What is the revenue prediction for next year? Typical questions for DM
  • 12. What is Data Mining? 12/68
  • 17. 1. Selection 2. Pre-processing 3. Transformation 4. Data Mining 5. Interpretation/Evaluation Magic part of KDD (Knowledge Discovery in Databases)
  • 18. 1. Share your date with us 2. Our magic manipulations 3. Building an answering machine 4. PROFIT!!! How it really works
  • 20. ● Facebook users, tweets ● Weather ● Sea routes ● Trade transactions ● Goverment ● Medicine (genomic data) ● Telecommuncations (phone call records) Data examples
  • 21. ● Relational Databases ● Data warehouses (Historical data) ● Files in CSV or in binary format ● Internet or electronic mails ● Scientific, research (R, Octave, Matlab) Data sources
  • 22. Target Data & Personal Data 22/68
  • 23.
  • 24. ● All your personal data (PD) are being deeply mined ● The industry of collecting, aggregating, and brokering PD is “database marketing.” ● 1.1 billion browser cookies, 200 million mobile profiles, and an average of 1,500 pieces of data per consumer in Acxiom Pay with your personal data
  • 26. ● Select small pieces ● Define default values for missed data ● Remove strange signals from data ● Merge some tables in one if required
  • 29. It is the process of finding model of function that describes and distinguishes data class to predict the class of objects whose class label is unknown. What is Cluster Analysis?
  • 30. ● Statistical process for estimating the relationships among variables ● The estimation target is function (it can be probability distribution) ● Can be linear, polynomial, nonlinear and etc. Regression
  • 31. ● Training set of classified examples (supervised learning) ● Test set of non-classified items ● Main goal: find a function (classifier) that maps input data to a category ● Computer vision, drug discovery, speech recognition, biometric indentification, credit scoring Classification
  • 34. ● There are two classes of objects A & B ● Define the class of new object, based on information about its neighbors ● Changing the boundaries of an new object area, we form a set of neighbors. ● New object is B becuase majority of the neighbors is a B. kNN (k-nearest neighbor)
  • 36.
  • 37.
  • 39.
  • 40. ● It’s free ● Not full implemented stack of ML algorythms ● All your matrix are belong to us! ● Single thread model ● Hard way to integrate something Why not Octave?
  • 41.
  • 42. ● Old ancient community ● Syntax is too sweet ● You should read 1000 lines in docs to write 1 line of code ● Single thread model for 95% algorythms ● Not good for your production Why not R?
  • 43. ● Now Python is an idol for young scientists due to the low barrier to entry ● Python is a glue ● It’s solving two-language problem ● Have you ever heard about a Jython? ● Not so long way to real Highload production Why not Python?
  • 45. ● A fast and efficient multidimensional array object ndarray ● Tools for reading and writing array-based data sets to disk ● Linear algebra operations, Fourier transform ● Tools for integration C, C++, and Fortran Son of the Fortran : NumPy
  • 46. ● Sparse matrix, signal processing tools, optimization, differential equation ● Integration of old Fortran libraries ● Tool for using inline C++ code to accelerate array computations ● Powerful stat and discrete probability distributions SciPy as a servant
  • 47.
  • 48. ● The DataFrame, a twodimensional tabular, column-oriented data structure with both row and column labels ● It is easy to reshape, slice and dice, perform aggregations, and select subsets (EASY JOIN AND GROUP BY IN CODE!!!!) ● High-performance time series functionality and tools well- suited for working with financial data ● Read and write CSV, JSON, HTML, XML, HDF5 and special binary formats Pandas : what’s new?
  • 49. All we need in draw (Matplotlib)
  • 50. ● The DataFrame, a twodimensional tabular, column-oriented data structure with both row and column labels ● It is easy to reshape, slice and dice, perform aggregations, and select subsets (EASY JOIN AND GROUP BY IN CODE!!!!) ● High-performance time series functionality and tools well- suited for working with financial data ● Read and write CSV, JSON, HTML, XML, HDF5 and special binary formats Pandas : what’s new?
  • 51. ● Classification, regression, clustering, decision trees ● Comparing, validating and choosing parameters and models ● Feature extraction and normalization ● Encoding categorical features ● Dimensionality Reduction Scikit-learn
  • 52. How to start mining your data 52/68
  • 53.
  • 54.
  • 57. ● MapReduce in memory ● Up to 50x faster than Hadoop ● Support for Shark (like Hive), MLlib (Machine learning), GraphX (graph processing) ● RDD is a basic building block (immutable distributed collections of objects) Spark
  • 58. Spark
  • 59. ● Python is dynamically typed, so RDDs can hold objects of multiple types ● PySpark does not yet support a few API calls, such as lookup and non-text input files ● PySpark requires Python 2.6 or higher (not tested with P3) ● PySpark applications are executed using a standard CPython interpreter ● MLlib is supported, GraphX is not supported PySpark (Python API for Spark)
  • 61.
  • 62.
  • 63. ● Python language data structures for graphs, digraphs, and multigraphs ● Generators for classic graphs, random graphs, and synthetic networks ● Analyzing : clustering, independent set, closeness, betweeness, connectivity, transitivity, flows, isomorphism, Page Rank NetworkX
  • 64. Graph Number of vertexes Number of edges Volume Data/per day Web-graph 1,5 * 10^12 1,2 * 10^13 100 PB 300 TB Facebook (friends graph) 1,1 * 10^9 160 * 10^9 1 PB 15 TB Road graph of EU 18 * 10^6 42 * 10^6 20 GB 50 MB Road graph of this city 250 000 460 000 500 MB 100 KB
  • 65. ● Leaderboard ● How accurate does my model really need to be? What data can I use? (bussiness problem) ● Good clear data for unbelievable situations ● Fun & enjoyment building model ● Kaggle is too far from your production Machine learning isn't Kaggle competitions
  • 66. Size Classification Tools Lines Sample Data Analysis and Visualization Whiteboard, bash KBs - low MBs Prototype Data Analysis and Visualization Matlab, Octave, R MBs - low GBs Online Data Storage MySQL (DBs) MBs - low GBs Online Data Analysis NumPy, SciPy, Weka, BLAS/LAPACK GBs - TBs - PBs BigData Storage HDFS, HBase, Cassandra GBs - TBs - PBs Big Data Analysis Hive, Mahout, Hama, Giraph,MLlib
  • 67. Mailing lists, google groups: pydata, pystatsmodels, numpy- discussion, scipy-user Conferences: PyCon, EuroPython and its friends SciPy and EuroSciPy Books: Python for Data Analysis, Mastering Machine Learning with scikit-learn Community, conferences and books