SlideShare a Scribd company logo
1 of 26
Download to read offline
When Big Data Meet Python

                             Jimmy Lai (賴弘哲)
                           jimmy.lai@oi-sys.com
                                2012/08/19
Slides: http://www.slideshare.net/jimmy_lai/when-big-data-meet-python


                          2012
 When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
                                                                                                                            1
自我介紹
• 賴弘哲 (Jimmy Lai)
• Interests: Data mining, Machine Learning,
  Natural Language Processing, Distributed
  Computing, Python
• LindedIn profile: http://goo.gl/XTEM5
• 現任職於引京聚點知識結構搜索公司,
  從事大資料語意分析


            2012                              2
Outline
1. Big Data
  a. Concept
  b. Technical issues
2. Big Data + Python
  a. Related open source tools
  b. Example




              2012               3
Benefits of Big Data
1. Creating transparency(透明度) e.g. http://www.data.gov/
2. Enabling experimentation to discover needs,
   expose variability, and improve
   performance(發現需求及潛在威脅、改善產能)
3. Segmenting populations to customize(客製化)
   actions
4. Replacing/supporting human decision making
   with automated algorithms(自動決策)
5. Innovating new business models, products and
   services(創新的服務、產業)
深度資料分析人才的短缺               (May 2011). Big Data: The next frontier for
                          innovation, competition, and productivity.
              2012        McKinsey Global Institute.                    4
Initiative from the White House
• (Mar 2012) Big Data Research and
  Development Initiative, the White House.
• National Science Foundation encourages
  education on Big Data.
• Government invest on developing state-of-
  the-art technologies, harness those
  technologies, and expand the workforce for
  Big Data.

            2012                               5
Big Data Issues
User Generated Content              Machine Generated Data



                         Collecting

                         Storage

                     Computing

                         Analysis

                    Visualization
          2012                                               6
Big Data Techniques
                     Machine
User Generated
   Content        Generated Data   • Crawler
                                     – Collect raw data
           Collecting                – E.g. Heritrix, Nutch
                                   • Scraping
            Storage                  – Parse information
           Computing
                                       from raw data
                                     – E.g. Yahoo! Pipes,
            Analysis                   Scrapy

          Visualization
                   2012                                       7
Big Data Techniques
User Generated       Machine
                  Generated Data
                                   • Big Table
   Content
                                     – Distributed key-value
                                       storage
           Collecting                – E.g.Hbase, Cassandra
                                   • NoSQL
            Storage                  – Not use SQL for
                                       manipulation
           Computing                 – Not use relational
                                       database model
            Analysis                 – E.g. MongoDB, Redis,
                                       CouchDB
          Visualization
                   2012                                    8
Big Data Techniques
                     Machine
User Generated
   Content        Generated Data   • Batch
                                     – MapReduce
           Collecting                – E.g. Hadoop
                                   • Real-time
            Storage                  – Stream processing
           Computing                 – E.g. S4, Storm

            Analysis

          Visualization
                   2012                                    9
Big Data Techniques
User Generated       Machine       • Data mining
   Content        Generated Data
                                      – Weka
                                   • Machine learning
           Collecting                 – scikit-learn
                                   • Natural language
            Storage                  processing
                                      – NLTK, Stanford NLP
           Computing               • Statistics
                                      –R
            Analysis

          Visualization
                   2012                                      10
Big Data Techniques
                     Machine
User Generated
   Content        Generated Data   • Abstract
                                   • Interactive
           Collecting              • E.g. Processing,
                                     Gephi, D3.js
            Storage

           Computing

            Analysis

          Visualization
                   2012                                 11
Why Python?
• Good code readability     • Fast growing among
  for fast development.       open source
• Scripting language: the     communities.
  less code, the more         – Commits statistics from
  productivity.                 ohloh.net




              2012                                        12
When Big Data meet Python
        User Generated       Machine
           Content        Generated Data



                   Collecting              Scrapy: scraping framework


                                       PyMongo: Python client for Mongodb
Infrastructure




                    Storage
                                       Hadoop streaming: Linux pipe interface
                   Computing           Disco: lightweight MapReduce in Python
                                       Pandas: data analysis/manipulation
                    Analysis           Statsmodels: statistics
                                       NLTK: natural language processing
                                       Scikit-learn: machine learning
                  Visualization        Matplotlib: plotting
                           2012        NetworkX: graph visualization            13
When Big Data meet Python
User Generated       Machine
                  Generated Data                            http://scrapy.org/
   Content
                                   web scraping framework
                                   • Simple and Extensible
           Collecting
                                   • Components:
                                      •   Scheduler
            Storage                   •   Downloader
                                      •   Spider(Scraper)
           Computing                  •   Item pipeline

            Analysis

          Visualization
                   2012                                                   14
When Big Data meet Python
User Generated       Machine
                                                       http://www.mongodb.org/
   Content        Generated Data
                                   NoSQL database
                                   • PyMongo: client for python
           Collecting
                                   • Document(JSON)-oriented
                                   • No schema
            Storage
                                   • Scalable
                                     • Auto-sharding
           Computing
                                     • Replica-set

            Analysis               • File storage
                                   • MapReduce aggregation
          Visualization
                   2012                                                15
When Big Data meet Python
                     Machine                           http://discoproject.org/
User Generated
   Content        Generated Data
                                   • Distributed computing:
                                      – MapReduce
           Collecting                 – Disco distributed file system
                                   • Write code in Python
            Storage                   – Easy/fast to profiling
                                      – Easy/fast to debugging
           Computing

            Analysis

          Visualization
                   2012                                                    16
When Big Data meet Python
User Generated       Machine
   Content        Generated Data
                                                     http://pandas.pydata.org/

                                   • Data analysis library
           Collecting              • Datastructure for fast data
                                     manipulation
                                      – Slicing
            Storage
                                      – Indexing
                                      – subsetting
           Computing
                                   • Handling missing data
            Analysis               • Aggregation
                                   • Time series
          Visualization
                   2012                                                     17
When Big Data meet Python
User Generated       Machine               Statsmodels
   Content        Generated Data           http://statsmodels.sourceforge.net/

                                   • Statistical analysis
           Collecting                • Statistical models
                                     • Fit data with model
            Storage                  • Statistical tests
                                     • Data exploration
           Computing                 • Time series analysis

            Analysis

          Visualization
                   2012                                                      18
When Big Data meet Python
User Generated       Machine                      scikit-learn
   Content        Generated Data                  http://scikit-learn.org/

                                   •   Machine learning algorithms
                                   •   Supervised learning
           Collecting
                                   •   Unsupervised learning
                                   •   Dataset
            Storage
                                       • Preprocessing
           Computing                   • feature extraction
                                   • Model
            Analysis                   • Selection
                                       • Pipeline
          Visualization
                   2012                                                      19
When Big Data meet Python
User Generated       Machine
   Content        Generated Data   NLTK: Natural Language Toolkit
                                                       http://scikit-learn.org/

                                   • Natural language processing
           Collecting              • Annotated corpora and resources
                                      Information Extraction Work Flow


            Storage                    Sentence
                                     Segmentation
                                                      Tokenization       POS tagging




           Computing                 Named Entity      Relation
                                      Recognition     Recognition



            Analysis

          Visualization
                   2012                                                            20
When Big Data meet Python
User Generated       Machine
   Content        Generated Data   NL
                                               http://matplotlib.sourceforge.net/

                                   • Plotting
           Collecting                 – Histograms
                                      – Power spectra
            Storage                   – Bar charts
                                      – Error charts
           Computing                  – Scatter plots
                                   • Full control to detail of plotting
            Analysis

          Visualization
                   2012                                                       21
When Big Data meet Python
User Generated       Machine
   Content        Generated Data   NetworkX http://networkx.lanl.gov/
                                   • Graph algorithms and
                                     visisualization
           Collecting
                                   • Draw graph with layout:
                                       –   Circular
            Storage                    –   Random
                                       –   Spectural
           Computing                   –   Spring
                                       –   Shell
            Analysis                   –   Graphviz


          Visualization
                   2012                                                 22
聚寶評 www.ezpao.com

      美食搜尋引擎




搜尋各大部落格食記

  2012              23
聚寶評 www.ezpao.com

     語意分析搜尋引擎




  2012              24
評論主題分析




  網友分享菜分析




   正評/負評分析




2012                  25
Thank you for your attention.
           Q&A
We are hiring!
• 核心引擎演算法研發工程師
• 系統研發工程師
• 網路應用研發工程師

Oxygen Intelligence Taiwan Limited
引京聚點 知識結構搜索股份有限公司
• 公司簡介: http://www.ezpao.com/about/
• 職缺簡介: http://www.ezpao.com/join/
• 請將履歷寄到 jimmy.lai@oi-sys.com

                              2012
     When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
                                                                                                                                26

More Related Content

What's hot

Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseJihoon Son
 
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDaysConexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDaysCAPSiDE
 
An introduction to U1db
An introduction to U1dbAn introduction to U1db
An introduction to U1dbDavid Planella
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsStefan Urbanek
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureLuiz Henrique Zambom Santana
 
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP ArchiveAkamai Edge: Tracking the Performance of the Web with HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP ArchiveRick Viscomi
 
Tracking the Performance of the Web with HTTP Archive
Tracking the Performance of the Web with HTTP ArchiveTracking the Performance of the Web with HTTP Archive
Tracking the Performance of the Web with HTTP ArchiveRick Viscomi
 
Data Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudioData Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudioWinston Chen
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB
 
Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Roy Russo
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLMongoDB
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo dbMongoDB
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandasmaikroeder
 
elasticsearch basics workshop
elasticsearch basics workshopelasticsearch basics workshop
elasticsearch basics workshopMathieu Elie
 

What's hot (20)

MongoDB and Python
MongoDB and PythonMongoDB and Python
MongoDB and Python
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDaysConexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
 
An introduction to U1db
An introduction to U1dbAn introduction to U1db
An introduction to U1db
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data Objects
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore Architecture
 
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP ArchiveAkamai Edge: Tracking the Performance of the Web with HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
 
Tracking the Performance of the Web with HTTP Archive
Tracking the Performance of the Web with HTTP ArchiveTracking the Performance of the Web with HTTP Archive
Tracking the Performance of the Web with HTTP Archive
 
Python and MongoDB
Python and MongoDB Python and MongoDB
Python and MongoDB
 
MongoDB
MongoDBMongoDB
MongoDB
 
Data Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudioData Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudio
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
 
Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQL
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo db
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
MongoDB
MongoDBMongoDB
MongoDB
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandas
 
elasticsearch basics workshop
elasticsearch basics workshopelasticsearch basics workshop
elasticsearch basics workshop
 

Viewers also liked

Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profitFederico Feroldi
 
Collecting web information with open source tools
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source toolsSammy Fung
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Sammy Fung
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...Anton
 
[LDSP] Solr Usage
[LDSP] Solr Usage[LDSP] Solr Usage
[LDSP] Solr UsageJimmy Lai
 
Fast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookFast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookJimmy Lai
 
Data Analyst Nanodegree
Data Analyst NanodegreeData Analyst Nanodegree
Data Analyst NanodegreeJimmy Lai
 
[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast PrototypingJimmy Lai
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
 
Software development practices in python
Software development practices in pythonSoftware development practices in python
Software development practices in pythonJimmy Lai
 
Documentation with sphinx @ PyHug
Documentation with sphinx @ PyHugDocumentation with sphinx @ PyHug
Documentation with sphinx @ PyHugJimmy Lai
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesApache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesJimmy Lai
 
Build a Searchable Knowledge Base
Build a Searchable Knowledge BaseBuild a Searchable Knowledge Base
Build a Searchable Knowledge BaseJimmy Lai
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHugJimmy Lai
 
NetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugNetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugJimmy Lai
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learnJimmy Lai
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
 

Viewers also liked (19)

Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profit
 
Collecting web information with open source tools
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source tools
 
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
 
摘星
摘星摘星
摘星
 
[LDSP] Solr Usage
[LDSP] Solr Usage[LDSP] Solr Usage
[LDSP] Solr Usage
 
Fast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookFast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython Notebook
 
Data Analyst Nanodegree
Data Analyst NanodegreeData Analyst Nanodegree
Data Analyst Nanodegree
 
[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
 
Software development practices in python
Software development practices in pythonSoftware development practices in python
Software development practices in python
 
Documentation with sphinx @ PyHug
Documentation with sphinx @ PyHugDocumentation with sphinx @ PyHug
Documentation with sphinx @ PyHug
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesApache thrift-RPC service cross languages
Apache thrift-RPC service cross languages
 
Build a Searchable Knowledge Base
Build a Searchable Knowledge BaseBuild a Searchable Knowledge Base
Build a Searchable Knowledge Base
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHug
 
NetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugNetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHug
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 

Similar to When big data meet python @ COSCUP 2012

Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranJAX London
 
From open data to API-driven business
From open data to API-driven businessFrom open data to API-driven business
From open data to API-driven businessOpenDataSoft
 
NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesVishy Poosala
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingm_hepburn
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop siliconsudipt
 
Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012MongoDB
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 
Rails with MongoDB
Rails with MongoDBRails with MongoDB
Rails with MongoDBEugene Park
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...i_scienceEU
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousingDataWorks Summit
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big DecisionsInnoTech
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
 
Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsGDi Techno Solutions
 
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUMETHE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUMEGigaom
 
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...Ohud Saud
 
NoSQL for the SQL Server Pro
NoSQL for the SQL Server ProNoSQL for the SQL Server Pro
NoSQL for the SQL Server ProLynn Langit
 

Similar to When big data meet python @ COSCUP 2012 (20)

Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
From open data to API-driven business
From open data to API-driven businessFrom open data to API-driven business
From open data to API-driven business
 
NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, Opportunities
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012
 
Pass bac jd_sm
Pass bac jd_smPass bac jd_sm
Pass bac jd_sm
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Rails with MongoDB
Rails with MongoDBRails with MongoDB
Rails with MongoDB
 
Ibm db2update2019 icp4 data
Ibm db2update2019   icp4 dataIbm db2update2019   icp4 data
Ibm db2update2019 icp4 data
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousing
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 
Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
 
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUMETHE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
 
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
 
NoSQL for the SQL Server Pro
NoSQL for the SQL Server ProNoSQL for the SQL Server Pro
NoSQL for the SQL Server Pro
 

Recently uploaded

SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTxtailishbaloch
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIVijayananda Mohire
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxNeo4j
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfTejal81
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.IPLOOK Networks
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0DanBrown980551
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTopCSSGallery
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptxHansamali Gamage
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxKaustubhBhavsar6
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingMAGNIntelligence
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2DianaGray10
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingFrancesco Corti
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applicationsnooralam814309
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxSatishbabu Gunukula
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and businessFrancesco Corti
 

Recently uploaded (20)

SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAI
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile Brochure
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development Companies
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptx
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced Computing
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is going
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applications
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptx
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and business
 

When big data meet python @ COSCUP 2012

  • 1. When Big Data Meet Python Jimmy Lai (賴弘哲) jimmy.lai@oi-sys.com 2012/08/19 Slides: http://www.slideshare.net/jimmy_lai/when-big-data-meet-python 2012 When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. 1
  • 2. 自我介紹 • 賴弘哲 (Jimmy Lai) • Interests: Data mining, Machine Learning, Natural Language Processing, Distributed Computing, Python • LindedIn profile: http://goo.gl/XTEM5 • 現任職於引京聚點知識結構搜索公司, 從事大資料語意分析 2012 2
  • 3. Outline 1. Big Data a. Concept b. Technical issues 2. Big Data + Python a. Related open source tools b. Example 2012 3
  • 4. Benefits of Big Data 1. Creating transparency(透明度) e.g. http://www.data.gov/ 2. Enabling experimentation to discover needs, expose variability, and improve performance(發現需求及潛在威脅、改善產能) 3. Segmenting populations to customize(客製化) actions 4. Replacing/supporting human decision making with automated algorithms(自動決策) 5. Innovating new business models, products and services(創新的服務、產業) 深度資料分析人才的短缺 (May 2011). Big Data: The next frontier for innovation, competition, and productivity. 2012 McKinsey Global Institute. 4
  • 5. Initiative from the White House • (Mar 2012) Big Data Research and Development Initiative, the White House. • National Science Foundation encourages education on Big Data. • Government invest on developing state-of- the-art technologies, harness those technologies, and expand the workforce for Big Data. 2012 5
  • 6. Big Data Issues User Generated Content Machine Generated Data Collecting Storage Computing Analysis Visualization 2012 6
  • 7. Big Data Techniques Machine User Generated Content Generated Data • Crawler – Collect raw data Collecting – E.g. Heritrix, Nutch • Scraping Storage – Parse information Computing from raw data – E.g. Yahoo! Pipes, Analysis Scrapy Visualization 2012 7
  • 8. Big Data Techniques User Generated Machine Generated Data • Big Table Content – Distributed key-value storage Collecting – E.g.Hbase, Cassandra • NoSQL Storage – Not use SQL for manipulation Computing – Not use relational database model Analysis – E.g. MongoDB, Redis, CouchDB Visualization 2012 8
  • 9. Big Data Techniques Machine User Generated Content Generated Data • Batch – MapReduce Collecting – E.g. Hadoop • Real-time Storage – Stream processing Computing – E.g. S4, Storm Analysis Visualization 2012 9
  • 10. Big Data Techniques User Generated Machine • Data mining Content Generated Data – Weka • Machine learning Collecting – scikit-learn • Natural language Storage processing – NLTK, Stanford NLP Computing • Statistics –R Analysis Visualization 2012 10
  • 11. Big Data Techniques Machine User Generated Content Generated Data • Abstract • Interactive Collecting • E.g. Processing, Gephi, D3.js Storage Computing Analysis Visualization 2012 11
  • 12. Why Python? • Good code readability • Fast growing among for fast development. open source • Scripting language: the communities. less code, the more – Commits statistics from productivity. ohloh.net 2012 12
  • 13. When Big Data meet Python User Generated Machine Content Generated Data Collecting Scrapy: scraping framework PyMongo: Python client for Mongodb Infrastructure Storage Hadoop streaming: Linux pipe interface Computing Disco: lightweight MapReduce in Python Pandas: data analysis/manipulation Analysis Statsmodels: statistics NLTK: natural language processing Scikit-learn: machine learning Visualization Matplotlib: plotting 2012 NetworkX: graph visualization 13
  • 14. When Big Data meet Python User Generated Machine Generated Data http://scrapy.org/ Content web scraping framework • Simple and Extensible Collecting • Components: • Scheduler Storage • Downloader • Spider(Scraper) Computing • Item pipeline Analysis Visualization 2012 14
  • 15. When Big Data meet Python User Generated Machine http://www.mongodb.org/ Content Generated Data NoSQL database • PyMongo: client for python Collecting • Document(JSON)-oriented • No schema Storage • Scalable • Auto-sharding Computing • Replica-set Analysis • File storage • MapReduce aggregation Visualization 2012 15
  • 16. When Big Data meet Python Machine http://discoproject.org/ User Generated Content Generated Data • Distributed computing: – MapReduce Collecting – Disco distributed file system • Write code in Python Storage – Easy/fast to profiling – Easy/fast to debugging Computing Analysis Visualization 2012 16
  • 17. When Big Data meet Python User Generated Machine Content Generated Data http://pandas.pydata.org/ • Data analysis library Collecting • Datastructure for fast data manipulation – Slicing Storage – Indexing – subsetting Computing • Handling missing data Analysis • Aggregation • Time series Visualization 2012 17
  • 18. When Big Data meet Python User Generated Machine Statsmodels Content Generated Data http://statsmodels.sourceforge.net/ • Statistical analysis Collecting • Statistical models • Fit data with model Storage • Statistical tests • Data exploration Computing • Time series analysis Analysis Visualization 2012 18
  • 19. When Big Data meet Python User Generated Machine scikit-learn Content Generated Data http://scikit-learn.org/ • Machine learning algorithms • Supervised learning Collecting • Unsupervised learning • Dataset Storage • Preprocessing Computing • feature extraction • Model Analysis • Selection • Pipeline Visualization 2012 19
  • 20. When Big Data meet Python User Generated Machine Content Generated Data NLTK: Natural Language Toolkit http://scikit-learn.org/ • Natural language processing Collecting • Annotated corpora and resources Information Extraction Work Flow Storage Sentence Segmentation Tokenization POS tagging Computing Named Entity Relation Recognition Recognition Analysis Visualization 2012 20
  • 21. When Big Data meet Python User Generated Machine Content Generated Data NL http://matplotlib.sourceforge.net/ • Plotting Collecting – Histograms – Power spectra Storage – Bar charts – Error charts Computing – Scatter plots • Full control to detail of plotting Analysis Visualization 2012 21
  • 22. When Big Data meet Python User Generated Machine Content Generated Data NetworkX http://networkx.lanl.gov/ • Graph algorithms and visisualization Collecting • Draw graph with layout: – Circular Storage – Random – Spectural Computing – Spring – Shell Analysis – Graphviz Visualization 2012 22
  • 23. 聚寶評 www.ezpao.com 美食搜尋引擎 搜尋各大部落格食記 2012 23
  • 24. 聚寶評 www.ezpao.com 語意分析搜尋引擎 2012 24
  • 25. 評論主題分析 網友分享菜分析 正評/負評分析 2012 25
  • 26. Thank you for your attention. Q&A We are hiring! • 核心引擎演算法研發工程師 • 系統研發工程師 • 網路應用研發工程師 Oxygen Intelligence Taiwan Limited 引京聚點 知識結構搜索股份有限公司 • 公司簡介: http://www.ezpao.com/about/ • 職缺簡介: http://www.ezpao.com/join/ • 請將履歷寄到 jimmy.lai@oi-sys.com 2012 When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. 26