SlideShare a Scribd company logo
HANDLING BIGGER DATA
What to do if your data’s too big
Data nerding
Your 5-7 things
❑ Bigger data
❑ Much bigger data
❑ Much bigger data storage
❑ Bigger data science teams
BIGGER DATA
Or, ‘data that’s a bit too big’
3
First, don’t panic
Computer storage
250Gb Internal hard drive. (hopefully)
permanent storage. The place you’re
storing photos, data etc
16Gb RAM. Temporary
storage. The place
read_csv loads your
dataset into.
2Tb External hard
drive. A handy place
to keep bigger
datafiles.
Gigabytes, Terabytes etc.
Name Size in bytes Contains (roughly)
Byte 1 1 character (‘a’, ‘1’ etc)
Kilobyte 1,000 Half a printed page
Megabyte 1,000,000 1 novella. 5Mb = complete works of Shakespeare
Gigabyte 1,000,000,000 1 high-fidelity symphony recording; 10m of shelved books
Terabyte 1,000,000,000,000 All the x-ray films in a large hospital; 10 = library of
congress collection. 2.6 = Panama Papers leak
Petabyte 1,000,000,000,000,000 2 = all US academic libraries; 10= 1 hour’s output from
SKA telescope
Exabyte 1,000,000,000,000,000,000 5 = all words ever spoken by humans
Zettabyte 1,000,000,000,000,000,000,000
Yottabyte 1,000,000,000,000,000,000,000,000 Current storage capacity of the Internet
Things to Try: Too Big
❑Read data in ‘chunks’
csv_chunks = pandas.read_csv(‘myfile.csv’, chunksize = 10000)
❑ Divide and conquer in your code:
csv_chunks = pandas.read_csv(‘myfile.csv’, skiprows=10000, chunksize = 10000)
❑Use parallel processing
❑ E.g the Dask library
Things to try: Too Slow
❑Use %timeit to find where the speed problems are
❑Use compiled python, (e.g. the Numba library)
❑Use C code (via Cython)
8
MUCH BIGGER DATA
Or, ‘What if it really doesn’t fit?’
9
Volume, Velocity, Variety
Much Faster Datastreams
Twitter firehose:
❑ Firehose averages 6,000 tweets per second
❑ Record is 143,199 tweets in one second (Aug 3rd 2013, Japan)
❑ Twitter public streams = 1% of Firehose steam
Google index (2013):
❑ 30 trillion unique pages on the internet
❑ Google index = 100 petabytes (100 million gigabytes)
❑ 100 billion web searches a month
❑ Search returned in about ⅛ second
Distributed systems
❑ Store the data on multiple ‘servers’:
❑ Big idea: Distributed file systems
❑ Replicate data (server hardware breaks more often than you think)
❑ Do the processing on multiple servers:
❑ Lots of code does the same thing to different pieces of data
❑ Big idea: Map/Reduce
Parallel Processors
❑Laptop: 4 cores, 16 GB RAM, 256 GB disk
❑Workstation: 24 cores, 1 TB RAM
❑Clusters: as big as you can imagine…
13
Distributed filesystems
Your typical rack server...
Map/Reduce: Crowdsourcing for computers
Distributed Programming Platforms
Hadoop
❑ HDFS: distributed filesystem
❑ MapReduce engine: processing
Spark
❑ In-memory processing
❑ Because moving data around is the biggest bottleneck
Typical (Current) Ecosystem
HDFS
Spark
Python
R
SQL
Tableau
Publisher
Data warehouse
Anaconda comes with this…
Parallel Python Libraries
❑ Dask
❑ Datasets look like NumpyArrays, Pandas DataFrames
❑ df.groupby(df.index).value.mean()
❑ Direct access into HDFS, S3 etc
❑ PySpark
❑ Also has DataFrames
❑ Connects to Spark
20
MUCH BIGGER DATA
STORAGE
Or, ‘Where do we put all this stuff?’
2
1
SQL Databases
❑ Row/column tables
❑ Keys
❑ SQL query language
❑ Joins etc (like Pandas)
ETL (Extract - Transform - Load)
❑ Extract
❑ Extract data from multiple sources
❑ Transform
❑ Convert data into database formats (e.g. sql)
❑ Load
❑ Load data into database
Data warehouses
NoSql Databases
❑ Not forced into row/column
❑ Lots of different types
❑ Key/value: can add feature without rewriting
tables
❑ Graph: stores nodes and edges
❑ Column: useful if you have a lot more reads
than writes
❑ Document: general-purpose. MongoDb is
commonly used.
Data Lakes
BIGGER DATA SCIENCE
TEAMS
Or, ‘Who does this stuff?’
2
7
Big Data Work
❑ Data Science
❑ Data Analysis
❑ Data Engineering
❑ Data Strategy
Big Data Science Teams
❑ Usually seen:
❑ Project manager
❑ Business analysts
❑ Data Scientists / Analysts: insight from data
❑ Data Engineers / Developers: data flow implementation, production systems
❑ Sometimes seen:
❑ Data Architect: data flow design
❑ User Experience / User Interface developer / Visual designer
Data Strategy
❑ Why should data be important here?
❑ Which business questions does this place have?
❑ What data does/could this place have access to?
❑ How much data work is already here?
❑ Who has the data science gene?
❑ What needs to change to make this place data-driven?
❑ People (training, culture)
❑ Processes
❑ Technologies (data access, storage, analysis tools)
❑ Data
Data Analysis
❑ What are the statistics of this dataset?
❑ E.g. which pages are popular
❑ Usually on already-formatted data, e.g. google analytics results
Data Science
❑ Ask an interesting question
❑ Get the data
❑ Explore the data
❑ Model the data
❑ Communicate and visualize your results
Data Engineering
❑ Big data storage
❑ SQL, NoSQL
❑ warehouses, lakes
❑ Cloud computing architectures
❑ Privacy / security
❑ Uptime
❑ Maintenance
❑ Big data analytics
❑ Distributed programming
platforms
❑ Privacy / security
❑ Uptime
❑ Maintenance
❑ etc.
EXERCISES
Or, ‘Trying some of this out’
3
4
Exercises
❑ Use pandas read_csv() to read a datafile in in chunks
LEARNING MORE
Or, ‘books’
3
6
READING
3
7
“Books are a
uniquely portable
magic” – Stephen
King
THANK YOU
sjterp@thoughtworks.com

More Related Content

What's hot

Digital Contact's big data presentation to the University of Kent
Digital Contact's big data presentation to the University of KentDigital Contact's big data presentation to the University of Kent
Digital Contact's big data presentation to the University of Kentdigitalcontact
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data ScienceTJ Stalcup
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data ScienceArc & Codementor
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big DataLewis Crawford
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopArchana Gopinath
 
Big data processing system
Big data processing systemBig data processing system
Big data processing systemshima jafari
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceVignesh Prajapati
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data scienceLoïc Lejoly
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectbodaceacat
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Natalino Busa
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsNatalino Busa
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupSri Kanajan
 
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsRoots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsWilliam Yetman
 

What's hot (20)

Digital Contact's big data presentation to the University of Kent
Digital Contact's big data presentation to the University of KentDigital Contact's big data presentation to the University of Kent
Digital Contact's big data presentation to the University of Kent
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data Science
 
Big Data And Hadoop
Big Data And HadoopBig Data And Hadoop
Big Data And Hadoop
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
Datawarehouse
DatawarehouseDatawarehouse
Datawarehouse
 
Big data processing system
Big data processing systemBig data processing system
Big data processing system
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
 
Why Hadoop is Useful?
Why Hadoop is Useful?Why Hadoop is Useful?
Why Hadoop is Useful?
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Hadoop
HadoopHadoop
Hadoop
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
 
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsRoots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
 
Top 10 data science technologies
Top 10 data science technologiesTop 10 data science technologies
Top 10 data science technologies
 

Similar to Session 10 handling bigger data

Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureChristos Charmatzis
 
Big data-denis-rothman
Big data-denis-rothmanBig data-denis-rothman
Big data-denis-rothmanDenis Rothman
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopStefano Paluello
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big DataArjen de Vries
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An OverviewArvind Kalyan
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - HadoopTalentica Software
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018Tom Grek
 

Similar to Session 10 handling bigger data (20)

Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
 
Hadoop-2.6.0 Slides
Hadoop-2.6.0 SlidesHadoop-2.6.0 Slides
Hadoop-2.6.0 Slides
 
Big data-denis-rothman
Big data-denis-rothmanBig data-denis-rothman
Big data-denis-rothman
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018
 

More from bodaceacat

CansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for MisinformationCansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for Misinformationbodaceacat
 
2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_masterbodaceacat
 
Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019bodaceacat
 
Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019bodaceacat
 
Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019bodaceacat
 
Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018bodaceacat
 
Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptxbodaceacat
 
Session 08 geospatial data
Session 08 geospatial dataSession 08 geospatial data
Session 08 geospatial databodaceacat
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptxbodaceacat
 
Session 06 machine learning.pptx
Session 06 machine learning.pptxSession 06 machine learning.pptx
Session 06 machine learning.pptxbodaceacat
 
Session 05 cleaning and exploring
Session 05 cleaning and exploringSession 05 cleaning and exploring
Session 05 cleaning and exploringbodaceacat
 
Session 04 communicating results
Session 04 communicating resultsSession 04 communicating results
Session 04 communicating resultsbodaceacat
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring databodaceacat
 
Session 02 python basics
Session 02 python basicsSession 02 python basics
Session 02 python basicsbodaceacat
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011bodaceacat
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011bodaceacat
 
Ardrone represent
Ardrone representArdrone represent
Ardrone representbodaceacat
 
Global pulse app connection manager
Global pulse app connection managerGlobal pulse app connection manager
Global pulse app connection managerbodaceacat
 
Un Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian InnovationUn Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian Innovationbodaceacat
 
Blue light services
Blue light servicesBlue light services
Blue light servicesbodaceacat
 

More from bodaceacat (20)

CansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for MisinformationCansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for Misinformation
 
2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master
 
Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019
 
Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019
 
Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019
 
Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018
 
Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptx
 
Session 08 geospatial data
Session 08 geospatial dataSession 08 geospatial data
Session 08 geospatial data
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptx
 
Session 06 machine learning.pptx
Session 06 machine learning.pptxSession 06 machine learning.pptx
Session 06 machine learning.pptx
 
Session 05 cleaning and exploring
Session 05 cleaning and exploringSession 05 cleaning and exploring
Session 05 cleaning and exploring
 
Session 04 communicating results
Session 04 communicating resultsSession 04 communicating results
Session 04 communicating results
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Session 02 python basics
Session 02 python basicsSession 02 python basics
Session 02 python basics
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
 
Ardrone represent
Ardrone representArdrone represent
Ardrone represent
 
Global pulse app connection manager
Global pulse app connection managerGlobal pulse app connection manager
Global pulse app connection manager
 
Un Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian InnovationUn Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian Innovation
 
Blue light services
Blue light servicesBlue light services
Blue light services
 

Recently uploaded

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单ewymefz
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单ewymefz
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsalex933524
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单ukgaet
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatheahmadsaood
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单ewymefz
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单ewymefz
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIAlejandraGmez176757
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单ewymefz
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单enxupq
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictJack Cole
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .NABLAS株式会社
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单vcaxypu
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单vcaxypu
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单nscud
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBAlireza Kamrani
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP
 

Recently uploaded (20)

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 

Session 10 handling bigger data

  • 1. HANDLING BIGGER DATA What to do if your data’s too big Data nerding
  • 2. Your 5-7 things ❑ Bigger data ❑ Much bigger data ❑ Much bigger data storage ❑ Bigger data science teams
  • 3. BIGGER DATA Or, ‘data that’s a bit too big’ 3
  • 5. Computer storage 250Gb Internal hard drive. (hopefully) permanent storage. The place you’re storing photos, data etc 16Gb RAM. Temporary storage. The place read_csv loads your dataset into. 2Tb External hard drive. A handy place to keep bigger datafiles.
  • 6. Gigabytes, Terabytes etc. Name Size in bytes Contains (roughly) Byte 1 1 character (‘a’, ‘1’ etc) Kilobyte 1,000 Half a printed page Megabyte 1,000,000 1 novella. 5Mb = complete works of Shakespeare Gigabyte 1,000,000,000 1 high-fidelity symphony recording; 10m of shelved books Terabyte 1,000,000,000,000 All the x-ray films in a large hospital; 10 = library of congress collection. 2.6 = Panama Papers leak Petabyte 1,000,000,000,000,000 2 = all US academic libraries; 10= 1 hour’s output from SKA telescope Exabyte 1,000,000,000,000,000,000 5 = all words ever spoken by humans Zettabyte 1,000,000,000,000,000,000,000 Yottabyte 1,000,000,000,000,000,000,000,000 Current storage capacity of the Internet
  • 7. Things to Try: Too Big ❑Read data in ‘chunks’ csv_chunks = pandas.read_csv(‘myfile.csv’, chunksize = 10000) ❑ Divide and conquer in your code: csv_chunks = pandas.read_csv(‘myfile.csv’, skiprows=10000, chunksize = 10000) ❑Use parallel processing ❑ E.g the Dask library
  • 8. Things to try: Too Slow ❑Use %timeit to find where the speed problems are ❑Use compiled python, (e.g. the Numba library) ❑Use C code (via Cython) 8
  • 9. MUCH BIGGER DATA Or, ‘What if it really doesn’t fit?’ 9
  • 11. Much Faster Datastreams Twitter firehose: ❑ Firehose averages 6,000 tweets per second ❑ Record is 143,199 tweets in one second (Aug 3rd 2013, Japan) ❑ Twitter public streams = 1% of Firehose steam Google index (2013): ❑ 30 trillion unique pages on the internet ❑ Google index = 100 petabytes (100 million gigabytes) ❑ 100 billion web searches a month ❑ Search returned in about ⅛ second
  • 12. Distributed systems ❑ Store the data on multiple ‘servers’: ❑ Big idea: Distributed file systems ❑ Replicate data (server hardware breaks more often than you think) ❑ Do the processing on multiple servers: ❑ Lots of code does the same thing to different pieces of data ❑ Big idea: Map/Reduce
  • 13. Parallel Processors ❑Laptop: 4 cores, 16 GB RAM, 256 GB disk ❑Workstation: 24 cores, 1 TB RAM ❑Clusters: as big as you can imagine… 13
  • 15. Your typical rack server...
  • 17. Distributed Programming Platforms Hadoop ❑ HDFS: distributed filesystem ❑ MapReduce engine: processing Spark ❑ In-memory processing ❑ Because moving data around is the biggest bottleneck
  • 20. Parallel Python Libraries ❑ Dask ❑ Datasets look like NumpyArrays, Pandas DataFrames ❑ df.groupby(df.index).value.mean() ❑ Direct access into HDFS, S3 etc ❑ PySpark ❑ Also has DataFrames ❑ Connects to Spark 20
  • 21. MUCH BIGGER DATA STORAGE Or, ‘Where do we put all this stuff?’ 2 1
  • 22. SQL Databases ❑ Row/column tables ❑ Keys ❑ SQL query language ❑ Joins etc (like Pandas)
  • 23. ETL (Extract - Transform - Load) ❑ Extract ❑ Extract data from multiple sources ❑ Transform ❑ Convert data into database formats (e.g. sql) ❑ Load ❑ Load data into database
  • 25. NoSql Databases ❑ Not forced into row/column ❑ Lots of different types ❑ Key/value: can add feature without rewriting tables ❑ Graph: stores nodes and edges ❑ Column: useful if you have a lot more reads than writes ❑ Document: general-purpose. MongoDb is commonly used.
  • 27. BIGGER DATA SCIENCE TEAMS Or, ‘Who does this stuff?’ 2 7
  • 28. Big Data Work ❑ Data Science ❑ Data Analysis ❑ Data Engineering ❑ Data Strategy
  • 29. Big Data Science Teams ❑ Usually seen: ❑ Project manager ❑ Business analysts ❑ Data Scientists / Analysts: insight from data ❑ Data Engineers / Developers: data flow implementation, production systems ❑ Sometimes seen: ❑ Data Architect: data flow design ❑ User Experience / User Interface developer / Visual designer
  • 30. Data Strategy ❑ Why should data be important here? ❑ Which business questions does this place have? ❑ What data does/could this place have access to? ❑ How much data work is already here? ❑ Who has the data science gene? ❑ What needs to change to make this place data-driven? ❑ People (training, culture) ❑ Processes ❑ Technologies (data access, storage, analysis tools) ❑ Data
  • 31. Data Analysis ❑ What are the statistics of this dataset? ❑ E.g. which pages are popular ❑ Usually on already-formatted data, e.g. google analytics results
  • 32. Data Science ❑ Ask an interesting question ❑ Get the data ❑ Explore the data ❑ Model the data ❑ Communicate and visualize your results
  • 33. Data Engineering ❑ Big data storage ❑ SQL, NoSQL ❑ warehouses, lakes ❑ Cloud computing architectures ❑ Privacy / security ❑ Uptime ❑ Maintenance ❑ Big data analytics ❑ Distributed programming platforms ❑ Privacy / security ❑ Uptime ❑ Maintenance ❑ etc.
  • 34. EXERCISES Or, ‘Trying some of this out’ 3 4
  • 35. Exercises ❑ Use pandas read_csv() to read a datafile in in chunks
  • 37. READING 3 7 “Books are a uniquely portable magic” – Stephen King