SlideShare a Scribd company logo
1 of 46
Download to read offline
Algorithmic Techniques for Big
Data Analysis
Barna Saha
AT&T Lab-Research
Need to ensure quality of data
Challenges of Big Data
•VOLUME
—Large amount of data
•VELOCITY
—Needs to be analyzed quickly
•VARIETY
—Different types of structured and
unstructured data
•VERACITY
—Low quality data, inconsistencies
This course
• Develop algorithms to deal with such data
– Emphasis on different models for processing data
– Common techniques and fundamental paradigm
• Style: algorithmic/theoretical
– Background in basic algorithms and probability
• However
– Important problems and practical solutions (mostly)
– Applied projects are OK and encouraged (more discussions to follow)
Grading
• Scribing once: 20%
• Participation (class and blog): 20%
– http://csci8980bigdataalgo.wordpress.com/: all information, lecture notes will be posted
there
– Add references
– Add datasets
– Write posts on interesting articles
– Discussquestions from the lecture
• Project: 60%
– Survey: 30%
• Read ~5 papers (~2 for background+~3 recent developments)
• Out of class presentation/discussion
– Project presentation+ Write-up: 30%
• Goal to have a project worthy of publication in a good conference in theory/ data bases/ data
mining.
• NO EXAMS
Timeline
• Scribe notes
– Due before next class. Template will be provided.
• Oct 3rd : submit names of 5 survey papers
– Encouraged to consult with me before
• Oct 17th: project proposal due
– Max2 pages + references
– Must be in latex, single column, article 11 pt.
• Nov 14th : a one page write-up describing progress on project due
• Dec 5th: project final write-up due
– At most 10 pages in latex + references+ appendix (if required to include
omitted proofs/experiments)
– Introduction with motivation + related works+ technical part +
experimental section
Office Hours
• By appointment: Friday 1pm to 5pm
– Send email to fix a time
• Email: barna@research.att.com, barna.cs@gmail.com
• Where: 6-198 KHKH
• Encouraged to discuss the progress on projects with me throughout
the semester
Tentative Syllabus
• Models
– Small memory algorithms
– External memory algorithms
– Distributed algorithms (Map Reduce)
– Crowdsourcing (People assisted computing)
• Focus: Efficiency vs Quality
– Near linear time algorithm design
– Incremental and update efficient algorithm design
– Sub-linear algorithm design / property testing
– Sparse transformation
– Dimensionality reduction
– Metric embedding
Tentative Course Plan
• Sept 5th Overview, Introduction to Data Streaming
• Sept 12th
• Sept 19th
• Sept 26th
• Oct 3rd Semi-streaming and External Memory Algorithms
• Oct 10th
• Oct 17th
• Oct 24th Property testing
• Oct 31st Sparse transformation or near-linear time algorithm design
• Nov 7th Metric embedding
• Nov 14th Crowdsourcing
• Nov 21st Project Presentation
• Nov 28th Thanksgiving break
• Dec 5th Project Presentation
Streaming: Sketching + Sampling, Dimensionality Reduction
Map Reduce Algorithms
Plan for this lecture
• 1st half: Overview of the course topics
• 2nd half: Introduction to Data Streaming
Models
• Different models need different algorithms for the same problem
– Default: Main Memory Model
– External Memory Model
– Streaming Model
– MapReduce
– Crowdsourcing
1. Do you have enough main memory ?
2. How much disk I/O are you performing ?
3. Is your data changing fast ?
4. Can you distribute your data to multiple
servers for fast processing ?
5. Is your data ambiguous that it needs
human power to process ?
Counting Distinct Elements
Given a sequence A= a1, a2, …, am where ai ∊ {1…n}, compute
the number of distinct elements in A (denoted by |A|).
• Natural and popular statistics, eg.
– Given the list of transactions, compute the number of different
customers (i.e. credit card numbers)
– What is the size of the web vocabulary ?
Example: 4 5 5 1 7 6 1 2 4 4 4 3 6 6
distinct elements=7
Counting Distinct Elements
• Default model: Random Access Main Memory Model
• Maintain an array of size n: B*1,…,n+—initially set to all “0”
• If item “i” arrives set B*i]=1
• Count the number of “1”s in B
Counting Distinct Elements
• Default model: Random Access Memory Model
• Maintain an array of size n: B*1,…,n+—initially set to all “0”
• If item “i” arrives set B*i]=1
• Count the number of “1”s in B
 O(m) running time
 Requires random access to B ( ? )
 Requires space n even though the
number of distinct elements is small
or m < n –domain may be much larger
Counting Distinct Elements
• Default model: Random Access Memory Model
• Initialize count=0, an array of lists B*1….O(m)+ and a hash function h
: ,1….n-  ,1…O(m)-
• For each ai
– Compute j=h(ai)
– Check if ai occurs in the list pointed to by B[j]
– If not, count=count+1 and add ai to the list
• Return count
Assuming that h(.) is random enough, running time is O(m), space usage O(m).
PROVE IT !
Space is still O(m)
Random access to B for each input
Counting Distinct Elements
• External Memory Model
– M units of main memory
– Input size m, m >> M
– Data is stored on disk:
• Space divided into blocks, each of size B <=M
• Transferring one block of data into the main memory takes unit
time
– Main memory operations for free but disk I/O is costly
– Goal is to reduce number of disk I/O
Distinct Elements in External Memory
• Sorting in external memory
• External Merge sort
– Split the data into M/B segments
– Recursively sort each segment
– Merge the segments using m/B block accesses
Example: M/B=3
No of disk I/O/merging= m/B
No of recursion call=logM/B m
Total sorting time=m/B logM/B m
Distinct Elements in External Memory
• Sorting in external memory
• External Merge sort
– Split the data into M/B segments
– Recursively sort each segment
– Merge the segments using m/B block accesses
Example: M/B=3
No of disk I/O/merging= m/B
No of recursion call=logM/B m
Total sorting time=m/B logM/B m
4 5 5 1 7 6 1 2 4 4 4 3 6 6
1 1 2 3 4 4 4 4 5 5 6 6 6 7
Count=1
For j=2,…,m
If aj > aj-1 count=count+1
Distinct Elements in Streaming Model
• Streaming Model
– Data comes in streaming fashion one at a time
(suppose from CD-ROM or cash-register)
– M units of main memory, M << m
– Only one pass over data
• Data not stored is lost
Distinct Elements in Streaming Model
• Suppose you want to know if the number of distinct elements is at
least “t”
• Initialize a hash function h:,1,…n-  ,1,…,t-
• Initialize the answer to NO
• For each ai:
– If h(ai) ==1, then set the answer to YES
The algorithm uses only 1 bit of storage ! (not counting the random bits for h)
Distinct Elements in Streaming Model
• Suppose you want to know if the number of distinct elements is at least
“t”
• Initialize a hash function h:,1,…m-  ,1,…,t-
• Initialize the answer to NO, count=0
• For each ai:
– If h(ai) ==1, then count++ (this run returns YES)
• Repeat the above procedure for log n different hash functions from the
family
– Set YES if count > log n (1-1/e) [Boosting the confidence]
The algorithm uses log n bit of storage ! (not countingthe random bits for h)
Run log(n) algorithmsin parallel using t=2,4,8,…n
Approximateanswerswith high probability > 1-1/n
Space usage O(log 2n)
Distinct Elements in Streaming Model
• Suppose you want to know if the number of distinct elements is at least
“t”
• Initialize a hash function h:,1,…m-  ,1,…,t-
• Initialize the answer to NO, count=0
• For each ai:
– If h(ai) ==1, then count++ (this run returns YES)
• Repeat the above procedure for log n different hash functions from the
family
– Set YES if count > log n (1-1/e) [Boosting the confidence]
The algorithm uses log n bit of storage ! (not countingthe random bits for h)
Run log(n) algorithmsin parallel using t=2,4,8,…n
Approximateanswerswith high probability > 1-1/n
Space usage O(log 2n)
Approximation and
Randomization are essential !
MapReduce Model
• Hardware is relatively cheap
• Plenty of parallel algorithms designed but
– Parallel programming is hard
• Threaded programs are difficult to test, debug, synchronization
issues, more machines mean more breakdown
• MapReduce makes parallel programming easy
MapReduce Model
• MapReduce makes parallel programming easy
– Tracks the jobs and restarts if needed
– Takes care of data distribution and synchronization
• But there is no free lunch:
– Imposes a structure on the data
– Only allows for certain kind of parallelism
MapReduce Model
• Data:
– Represented as <Key, Value> pairs
• Map:
– Data  List < Key, Value> [programmer specified]
• Shuffle:
– Aggregate all pairs with the same key [handled by system]
• Reduce:
– <Key, List(Value)> <Key, List(Value)> [programmer specified]
Distinct Elements in MapReduce
• r servers
• Data
– [1,a1], [2,a2+,…., *n,am]
• Map
– [1,a1], [2,a2+,…., *n,am][1,a1], [1,a2+,…*1,am/r] ,[2,am/r+1+,…., *2,a2m/r+,….,*r,am]
• Reduce
– Reducer 1: [1,a1], [1,a2+,…*1,am/r]  [1,a1], [1,a2+,…*1,am/r] ,[1,h()] generates the hash
function)
– Reducer 2: [2,am/r+1], [2,am/r+2+,…*2,a2m/r] [2,am/r+1], [2,am/r+2+,…*2,a2m/r]
– …
• Map
[1,a1], [1,a2+,…*1,am/r] ,[2,am/r+1+,…., *2,a2m/r+,….,*r,am],[1,h()] [1,a1], [1,a2+,…*1,am/r]
,[2,am/r+1+,…., *2,a2m/r+,….,*r,am+, *1,h()+, *2,h()+, …., *r,h()] makes multiple copies of the
hash function for distribution
• Reduce
– Reducer 1: [1,a1], [1,a2+,…*1,aN/r] ,[1,h()], create sketch B1, outputs [1,B1]
– ……
• Map
– [1,B1], [2,B2+,….,*r,Br] [1,B1], [1,B2+,….,*1,Br] gathers all the sketches
• Reduce
– Reducer1: [1,B1], [1,B2+,….,*1,Br], computes B= B1+B2+…..+Br, , Follows the Streaming
Algorithm to compute distinct elements from the sketch
Crowdsourcing
• Incorporating human power for data gathering and computing
• People still outperform state-of-the-art algorithms for many data
intensive tasks
– Typically involve ambiguity, deep understanding of language or context
or subjective reasoning
Distinct Elements by Crowdsourcing
• Ask for each pair if they are equal
• Create a graph with each element as node
• Add an edge between two nodes if the corresponding pairs are
returned to be equal
• Return number of connected components
• Also known as record linkage, entity resolution, deduplication
Distinct Elements by Crowdsourcing
Distinct
elements=4
Distinct Elements by Crowdsourcing
Distinct
elements=4
Too many questions to crowd ! Costly.
Can we reduce the number of questions ?
Scalable Algorithm Design
Is linear time good enough ?
– Don’t even read the entire data and return
result !!
– Hope: do not require exact answer
Near linear time algorithm
design
Seemingly best one can do
since reading data needs
linear time
Scalable Algorithm Design
Is linear time good enough ?
– Don’t even read the entire data and return
result !!
– Hope: do not require exact answer
Near linear time algorithm
design
Seemingly best one can do
since reading data needs
linear time
Property Testing
“In the ballpark” vs. “out of the ballpark”
tests
• Distinguish inputs that have specific property from
those that are far from having the property
• Benefits:
– May be the natural question to ask
– May be just as good when data constantly changing
– Gives fast sanity check to rule out very “bad” inputs (i.e.,
restaurant bills) or to decide when expensive processing is
worth it
Transactions of 20-30 yr olds Transactions of 30-40 yr olds
trend change?
Trend change analysis
Outbreak of diseases
• Do two diseases follow similar patterns?
• Are they correlated with income level or zip code?
• Are they more prevalent near certain areas?
Is the lottery uniform?
• New Jersey Pick-k Lottery (k =3,4)
– Pick k digits in order.
– 10k possible values.
• Are all values equally likely to occur ?
Global statistical properties:
• Decisions based on samples of distribution
• Properties: similarities, correlations,
information content, distribution of data,…
Another Example
Pattern matching on Strings
• Are two strings similar or not? (number of
deletions/insertions to change one into the other)
– Text
– Website content
– DNA sequences
ACTGCTGTACTGACT (length 15)
CATCTGTATTGAT (length 13)
match size =11
Pattern matching on Strings
• Previous algorithms using classical techniques
for computing edit distance on strings of size n
use at least n2 time
– For strings of size 1000, this is 1,000,000
– Can you compute edit distance in near linear time ?
– Can you test whether two strings have edit distance
below c (small) or above c’ (big) in sub-linear time ?
Pattern matching on Strings
• Previous algorithms using classical techniques
for computing edit distance on strings of size n
use at least n2 time
– For strings of size 1000, this is 1,000,000
– Can we compute edit distance in near linear time ?
• Various methods, key is metric embedding
– Can we test whether two strings have edit distance
below c (small) or above c’ (big) in sub-linear time ?
• Property testing
Metric Embedding
• Metric space is a pair (X, d) where X is a set and d: X x X*0,∞+ is a
metric satisfying
i.) d(x,y)=0 if and only if x=y
ii.) d(x,y)=d(y,x)
iii.) d(x,y)+d(y,z) >= d(x,z)
Dissimilarity matrix for bacterial
strain in microbiology
x1 x2 xn
x1
x2
xn
f: X  R2
d(xa, xb)=d(f(xa), f(xb))
1.) succinct representation
2.) easy to understand structure
3.) efficient algorithms
Metric Embedding
• Metric space is a pair (X, d) where X is a set and d: X x X*0,∞+ is a
metric satisfying
i.) d(x,y)=0 if and only if x=y
ii.) d(x,y)=d(y,x)
iii.) d(x,y)+d(y,z) >= d(x,z)
Dissimilarity matrix for bacterial
strain in microbiology
x1 x2 xn
x1
x2
xn
f: X  R2
d(xa, xb)=d(f(xa), f(xb))
1.) succinct representation
2.) easy to understand structure
3.) efficient algorithms
Isometric Embedding
Metric Embedding
• C-embedding: Embedding with distortion
• Example
– General n-point metric  Tree metric C=O(log n)
– Specific metrics  normed space : edit distance (Levenshtein distance to low
dimensional l1)
– Graphs  t-spanner C=t
– High dimensional spaces  low dimensional spaces : Johnson-Lindenstrauss
theorem: flattening in l2 ,C=(1+epsilon) [Dimensionality Reduction]
f: X  X’, (X,d) and (X,d’) are metric
∃ r ∊ (0,∞ ) s.t r d(xa, xb) <=d(f(xa), f(xb))<= C r d(xa, xb)
Distortion=C
Sparse Transformation
• Sparse Fourier Transform
• Sparse Johnson-Lindenstrauss
– Fast dimensionality reduction
• Discrete Fourier Transform:
– Given: a signal x*1…n]
– Goal: compute the frequency
vector x’ where
x’f = Σt xt e-2πi tf/n
• Very useful tool:
– Compression (audio, image, video)
– Data analysis
– Feature extraction
– …
• See SIGMOD’04 tutorial
“Indexing and Mining Streams”
by C. Faloutsos
Fourier Transform
Sampled Audio Data (Time)
DFT of Audio Samples (Frequency)
Computing DFT
• Fast Fourier Transform (FFT) computes
the frequencies in time O(n log n)
• But, we can do (much) better if we only
care about small number k of “dominant
frequencies”
– E.g., recover assume it is k-sparse (only k
non-zero entries)
• Exactly k-sparse signals: O(klog n)
• Approx. k-sparse signals* : O(klog n * log(n/k))
Agenda
• Introduction to Data Streaming Model
• Finding Frequent Items Deterministically
• Lower bound for deterministic computation for distinct items
• Interlude to concentration inequality
• Analysis of counting distinct items algorithms

More Related Content

What's hot

Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
 
Data science : R Basics Harvard University
Data science : R Basics Harvard UniversityData science : R Basics Harvard University
Data science : R Basics Harvard UniversityMrMoliya
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCAapo Kyrölä
 
Apache con big data 2015 magellan
Apache con big data 2015 magellanApache con big data 2015 magellan
Apache con big data 2015 magellanRam Sriharsha
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTDavide Gallitelli
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney
 

What's hot (11)

Data Structures & Algorithms
Data Structures & AlgorithmsData Structures & Algorithms
Data Structures & Algorithms
 
Lecture23
Lecture23Lecture23
Lecture23
 
Latex intro
Latex introLatex intro
Latex intro
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Data science : R Basics Harvard University
Data science : R Basics Harvard UniversityData science : R Basics Harvard University
Data science : R Basics Harvard University
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 
Data Structure and Algorithms
Data Structure and AlgorithmsData Structure and Algorithms
Data Structure and Algorithms
 
Apache con big data 2015 magellan
Apache con big data 2015 magellanApache con big data 2015 magellan
Apache con big data 2015 magellan
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDT
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
 
Isam
IsamIsam
Isam
 

Viewers also liked

Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousingDataWorks Summit
 
Applying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopApplying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopMark Johnson
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBradford Stephens
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big DataBernard Marr
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsDavid Pittman
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataMarko Rodriguez
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data ScientistDaniel Tunkelang
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraMatthias Broecheler
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)Prof. Dr. Diego Kuonen
 
Big Data
Big DataBig Data
Big DataNGDATA
 

Viewers also liked (20)

Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousing
 
Pre processing big data
Pre processing big dataPre processing big data
Pre processing big data
 
Big Data
Big DataBig Data
Big Data
 
Big Data Overview
Big Data OverviewBig Data Overview
Big Data Overview
 
Applying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopApplying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and Hadoop
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Big data Overview
Big data OverviewBig data Overview
Big data Overview
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph Data
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with Cassandra
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Big Data
Big DataBig Data
Big Data
 
Introduction to R for Data Mining
Introduction to R for Data MiningIntroduction to R for Data Mining
Introduction to R for Data Mining
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 

Similar to Algorithmic techniques-for-big-data-analysis

Algorithm and Data Structures - Basic of IT Problem Solving
Algorithm and Data Structures - Basic of IT Problem SolvingAlgorithm and Data Structures - Basic of IT Problem Solving
Algorithm and Data Structures - Basic of IT Problem Solvingcoolpie
 
Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)Andrew Clegg
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.pptchatbot9
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer visionEran Shlomo
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Databricks
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptxShree Shree
 

Similar to Algorithmic techniques-for-big-data-analysis (20)

14 query processing-sorting
14 query processing-sorting14 query processing-sorting
14 query processing-sorting
 
Lecture 1 (bce-7)
Lecture   1 (bce-7)Lecture   1 (bce-7)
Lecture 1 (bce-7)
 
Algorithm and Data Structures - Basic of IT Problem Solving
Algorithm and Data Structures - Basic of IT Problem SolvingAlgorithm and Data Structures - Basic of IT Problem Solving
Algorithm and Data Structures - Basic of IT Problem Solving
 
Enar short course
Enar short courseEnar short course
Enar short course
 
Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.ppt
 
DynamodbDB Deep Dive
DynamodbDB Deep DiveDynamodbDB Deep Dive
DynamodbDB Deep Dive
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 

More from Atner Yegorov

Здравоохранение Чувашии 2015
Здравоохранение Чувашии 2015Здравоохранение Чувашии 2015
Здравоохранение Чувашии 2015Atner Yegorov
 
Чувашия. Итоги за 5 лет. 2015
Чувашия. Итоги за 5 лет. 2015Чувашия. Итоги за 5 лет. 2015
Чувашия. Итоги за 5 лет. 2015Atner Yegorov
 
Инвестиционная политика Чувашской Республики 2015
Инвестиционная политика Чувашской Республики 2015Инвестиционная политика Чувашской Республики 2015
Инвестиционная политика Чувашской Республики 2015Atner Yegorov
 
Господдержка малого и среднего бизнеса в Чувашии 2015
Господдержка малого и среднего бизнеса в Чувашии 2015Господдержка малого и среднего бизнеса в Чувашии 2015
Господдержка малого и среднего бизнеса в Чувашии 2015Atner Yegorov
 
ИННОВАЦИИ В ПРОМЫШЛЕННОСТИ ЧУВАШИИ. 2015
ИННОВАЦИИ В ПРОМЫШЛЕННОСТИ ЧУВАШИИ. 2015ИННОВАЦИИ В ПРОМЫШЛЕННОСТИ ЧУВАШИИ. 2015
ИННОВАЦИИ В ПРОМЫШЛЕННОСТИ ЧУВАШИИ. 2015Atner Yegorov
 
Брендинг Чувашии
Брендинг ЧувашииБрендинг Чувашии
Брендинг ЧувашииAtner Yegorov
 
VIII Экономический форум Чебоксары
VIII Экономический форум ЧебоксарыVIII Экономический форум Чебоксары
VIII Экономический форум ЧебоксарыAtner Yegorov
 
Рейтин инвестклимата в регионах 2015
Рейтин инвестклимата в регионах 2015Рейтин инвестклимата в регионах 2015
Рейтин инвестклимата в регионах 2015Atner Yegorov
 
Аттракционы тематического парка, Татарстан
Аттракционы тематического парка, ТатарстанАттракционы тематического парка, Татарстан
Аттракционы тематического парка, ТатарстанAtner Yegorov
 
Тематический парк, Татарстан
Тематический парк, ТатарстанТематический парк, Татарстан
Тематический парк, ТатарстанAtner Yegorov
 
Будущее журналистики
Будущее журналистикиБудущее журналистики
Будущее журналистикиAtner Yegorov
 
Будущее нишевых СМИ
Будущее нишевых СМИБудущее нишевых СМИ
Будущее нишевых СМИAtner Yegorov
 
Будущее онлайн СМИ в регионах
Будущее онлайн СМИ в регионах Будущее онлайн СМИ в регионах
Будущее онлайн СМИ в регионах Atner Yegorov
 
Война форматов. Как люди читают новости
Война форматов. Как люди читают новости Война форматов. Как люди читают новости
Война форматов. Как люди читают новости Atner Yegorov
 
Национальная технологическая инициатива
Национальная технологическая инициативаНациональная технологическая инициатива
Национальная технологическая инициативаAtner Yegorov
 
О РАЗРАБОТКЕ И РЕАЛИЗАЦИИ НАЦИОНАЛЬНОЙ ТЕХНОЛОГИЧЕСКОЙ ИНИЦИАТИВЫ
О РАЗРАБОТКЕ И РЕАЛИЗАЦИИ НАЦИОНАЛЬНОЙ  ТЕХНОЛОГИЧЕСКОЙ ИНИЦИАТИВЫ О РАЗРАБОТКЕ И РЕАЛИЗАЦИИ НАЦИОНАЛЬНОЙ  ТЕХНОЛОГИЧЕСКОЙ ИНИЦИАТИВЫ
О РАЗРАБОТКЕ И РЕАЛИЗАЦИИ НАЦИОНАЛЬНОЙ ТЕХНОЛОГИЧЕСКОЙ ИНИЦИАТИВЫ Atner Yegorov
 
Участники заседания по модернизации
Участники заседания по модернизацииУчастники заседания по модернизации
Участники заседания по модернизацииAtner Yegorov
 
Город Innopolis
Город InnopolisГород Innopolis
Город InnopolisAtner Yegorov
 
презентация1
презентация1презентация1
презентация1Atner Yegorov
 

More from Atner Yegorov (20)

Здравоохранение Чувашии 2015
Здравоохранение Чувашии 2015Здравоохранение Чувашии 2015
Здравоохранение Чувашии 2015
 
Чувашия. Итоги за 5 лет. 2015
Чувашия. Итоги за 5 лет. 2015Чувашия. Итоги за 5 лет. 2015
Чувашия. Итоги за 5 лет. 2015
 
Инвестиционная политика Чувашской Республики 2015
Инвестиционная политика Чувашской Республики 2015Инвестиционная политика Чувашской Республики 2015
Инвестиционная политика Чувашской Республики 2015
 
Господдержка малого и среднего бизнеса в Чувашии 2015
Господдержка малого и среднего бизнеса в Чувашии 2015Господдержка малого и среднего бизнеса в Чувашии 2015
Господдержка малого и среднего бизнеса в Чувашии 2015
 
ИННОВАЦИИ В ПРОМЫШЛЕННОСТИ ЧУВАШИИ. 2015
ИННОВАЦИИ В ПРОМЫШЛЕННОСТИ ЧУВАШИИ. 2015ИННОВАЦИИ В ПРОМЫШЛЕННОСТИ ЧУВАШИИ. 2015
ИННОВАЦИИ В ПРОМЫШЛЕННОСТИ ЧУВАШИИ. 2015
 
Брендинг Чувашии
Брендинг ЧувашииБрендинг Чувашии
Брендинг Чувашии
 
VIII Экономический форум Чебоксары
VIII Экономический форум ЧебоксарыVIII Экономический форум Чебоксары
VIII Экономический форум Чебоксары
 
Рейтин инвестклимата в регионах 2015
Рейтин инвестклимата в регионах 2015Рейтин инвестклимата в регионах 2015
Рейтин инвестклимата в регионах 2015
 
Аттракционы тематического парка, Татарстан
Аттракционы тематического парка, ТатарстанАттракционы тематического парка, Татарстан
Аттракционы тематического парка, Татарстан
 
Тематический парк, Татарстан
Тематический парк, ТатарстанТематический парк, Татарстан
Тематический парк, Татарстан
 
Будущее журналистики
Будущее журналистикиБудущее журналистики
Будущее журналистики
 
Будущее нишевых СМИ
Будущее нишевых СМИБудущее нишевых СМИ
Будущее нишевых СМИ
 
Relap
RelapRelap
Relap
 
Будущее онлайн СМИ в регионах
Будущее онлайн СМИ в регионах Будущее онлайн СМИ в регионах
Будущее онлайн СМИ в регионах
 
Война форматов. Как люди читают новости
Война форматов. Как люди читают новости Война форматов. Как люди читают новости
Война форматов. Как люди читают новости
 
Национальная технологическая инициатива
Национальная технологическая инициативаНациональная технологическая инициатива
Национальная технологическая инициатива
 
О РАЗРАБОТКЕ И РЕАЛИЗАЦИИ НАЦИОНАЛЬНОЙ ТЕХНОЛОГИЧЕСКОЙ ИНИЦИАТИВЫ
О РАЗРАБОТКЕ И РЕАЛИЗАЦИИ НАЦИОНАЛЬНОЙ  ТЕХНОЛОГИЧЕСКОЙ ИНИЦИАТИВЫ О РАЗРАБОТКЕ И РЕАЛИЗАЦИИ НАЦИОНАЛЬНОЙ  ТЕХНОЛОГИЧЕСКОЙ ИНИЦИАТИВЫ
О РАЗРАБОТКЕ И РЕАЛИЗАЦИИ НАЦИОНАЛЬНОЙ ТЕХНОЛОГИЧЕСКОЙ ИНИЦИАТИВЫ
 
Участники заседания по модернизации
Участники заседания по модернизацииУчастники заседания по модернизации
Участники заседания по модернизации
 
Город Innopolis
Город InnopolisГород Innopolis
Город Innopolis
 
презентация1
презентация1презентация1
презентация1
 

Recently uploaded

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Recently uploaded (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

Algorithmic techniques-for-big-data-analysis

  • 1. Algorithmic Techniques for Big Data Analysis Barna Saha AT&T Lab-Research
  • 2. Need to ensure quality of data Challenges of Big Data •VOLUME —Large amount of data •VELOCITY —Needs to be analyzed quickly •VARIETY —Different types of structured and unstructured data •VERACITY —Low quality data, inconsistencies
  • 3. This course • Develop algorithms to deal with such data – Emphasis on different models for processing data – Common techniques and fundamental paradigm • Style: algorithmic/theoretical – Background in basic algorithms and probability • However – Important problems and practical solutions (mostly) – Applied projects are OK and encouraged (more discussions to follow)
  • 4. Grading • Scribing once: 20% • Participation (class and blog): 20% – http://csci8980bigdataalgo.wordpress.com/: all information, lecture notes will be posted there – Add references – Add datasets – Write posts on interesting articles – Discussquestions from the lecture • Project: 60% – Survey: 30% • Read ~5 papers (~2 for background+~3 recent developments) • Out of class presentation/discussion – Project presentation+ Write-up: 30% • Goal to have a project worthy of publication in a good conference in theory/ data bases/ data mining. • NO EXAMS
  • 5. Timeline • Scribe notes – Due before next class. Template will be provided. • Oct 3rd : submit names of 5 survey papers – Encouraged to consult with me before • Oct 17th: project proposal due – Max2 pages + references – Must be in latex, single column, article 11 pt. • Nov 14th : a one page write-up describing progress on project due • Dec 5th: project final write-up due – At most 10 pages in latex + references+ appendix (if required to include omitted proofs/experiments) – Introduction with motivation + related works+ technical part + experimental section
  • 6. Office Hours • By appointment: Friday 1pm to 5pm – Send email to fix a time • Email: barna@research.att.com, barna.cs@gmail.com • Where: 6-198 KHKH • Encouraged to discuss the progress on projects with me throughout the semester
  • 7. Tentative Syllabus • Models – Small memory algorithms – External memory algorithms – Distributed algorithms (Map Reduce) – Crowdsourcing (People assisted computing) • Focus: Efficiency vs Quality – Near linear time algorithm design – Incremental and update efficient algorithm design – Sub-linear algorithm design / property testing – Sparse transformation – Dimensionality reduction – Metric embedding
  • 8. Tentative Course Plan • Sept 5th Overview, Introduction to Data Streaming • Sept 12th • Sept 19th • Sept 26th • Oct 3rd Semi-streaming and External Memory Algorithms • Oct 10th • Oct 17th • Oct 24th Property testing • Oct 31st Sparse transformation or near-linear time algorithm design • Nov 7th Metric embedding • Nov 14th Crowdsourcing • Nov 21st Project Presentation • Nov 28th Thanksgiving break • Dec 5th Project Presentation Streaming: Sketching + Sampling, Dimensionality Reduction Map Reduce Algorithms
  • 9. Plan for this lecture • 1st half: Overview of the course topics • 2nd half: Introduction to Data Streaming
  • 10. Models • Different models need different algorithms for the same problem – Default: Main Memory Model – External Memory Model – Streaming Model – MapReduce – Crowdsourcing 1. Do you have enough main memory ? 2. How much disk I/O are you performing ? 3. Is your data changing fast ? 4. Can you distribute your data to multiple servers for fast processing ? 5. Is your data ambiguous that it needs human power to process ?
  • 11. Counting Distinct Elements Given a sequence A= a1, a2, …, am where ai ∊ {1…n}, compute the number of distinct elements in A (denoted by |A|). • Natural and popular statistics, eg. – Given the list of transactions, compute the number of different customers (i.e. credit card numbers) – What is the size of the web vocabulary ? Example: 4 5 5 1 7 6 1 2 4 4 4 3 6 6 distinct elements=7
  • 12. Counting Distinct Elements • Default model: Random Access Main Memory Model • Maintain an array of size n: B*1,…,n+—initially set to all “0” • If item “i” arrives set B*i]=1 • Count the number of “1”s in B
  • 13. Counting Distinct Elements • Default model: Random Access Memory Model • Maintain an array of size n: B*1,…,n+—initially set to all “0” • If item “i” arrives set B*i]=1 • Count the number of “1”s in B  O(m) running time  Requires random access to B ( ? )  Requires space n even though the number of distinct elements is small or m < n –domain may be much larger
  • 14. Counting Distinct Elements • Default model: Random Access Memory Model • Initialize count=0, an array of lists B*1….O(m)+ and a hash function h : ,1….n-  ,1…O(m)- • For each ai – Compute j=h(ai) – Check if ai occurs in the list pointed to by B[j] – If not, count=count+1 and add ai to the list • Return count Assuming that h(.) is random enough, running time is O(m), space usage O(m). PROVE IT ! Space is still O(m) Random access to B for each input
  • 15. Counting Distinct Elements • External Memory Model – M units of main memory – Input size m, m >> M – Data is stored on disk: • Space divided into blocks, each of size B <=M • Transferring one block of data into the main memory takes unit time – Main memory operations for free but disk I/O is costly – Goal is to reduce number of disk I/O
  • 16. Distinct Elements in External Memory • Sorting in external memory • External Merge sort – Split the data into M/B segments – Recursively sort each segment – Merge the segments using m/B block accesses Example: M/B=3 No of disk I/O/merging= m/B No of recursion call=logM/B m Total sorting time=m/B logM/B m
  • 17. Distinct Elements in External Memory • Sorting in external memory • External Merge sort – Split the data into M/B segments – Recursively sort each segment – Merge the segments using m/B block accesses Example: M/B=3 No of disk I/O/merging= m/B No of recursion call=logM/B m Total sorting time=m/B logM/B m 4 5 5 1 7 6 1 2 4 4 4 3 6 6 1 1 2 3 4 4 4 4 5 5 6 6 6 7 Count=1 For j=2,…,m If aj > aj-1 count=count+1
  • 18. Distinct Elements in Streaming Model • Streaming Model – Data comes in streaming fashion one at a time (suppose from CD-ROM or cash-register) – M units of main memory, M << m – Only one pass over data • Data not stored is lost
  • 19. Distinct Elements in Streaming Model • Suppose you want to know if the number of distinct elements is at least “t” • Initialize a hash function h:,1,…n-  ,1,…,t- • Initialize the answer to NO • For each ai: – If h(ai) ==1, then set the answer to YES The algorithm uses only 1 bit of storage ! (not counting the random bits for h)
  • 20. Distinct Elements in Streaming Model • Suppose you want to know if the number of distinct elements is at least “t” • Initialize a hash function h:,1,…m-  ,1,…,t- • Initialize the answer to NO, count=0 • For each ai: – If h(ai) ==1, then count++ (this run returns YES) • Repeat the above procedure for log n different hash functions from the family – Set YES if count > log n (1-1/e) [Boosting the confidence] The algorithm uses log n bit of storage ! (not countingthe random bits for h) Run log(n) algorithmsin parallel using t=2,4,8,…n Approximateanswerswith high probability > 1-1/n Space usage O(log 2n)
  • 21. Distinct Elements in Streaming Model • Suppose you want to know if the number of distinct elements is at least “t” • Initialize a hash function h:,1,…m-  ,1,…,t- • Initialize the answer to NO, count=0 • For each ai: – If h(ai) ==1, then count++ (this run returns YES) • Repeat the above procedure for log n different hash functions from the family – Set YES if count > log n (1-1/e) [Boosting the confidence] The algorithm uses log n bit of storage ! (not countingthe random bits for h) Run log(n) algorithmsin parallel using t=2,4,8,…n Approximateanswerswith high probability > 1-1/n Space usage O(log 2n) Approximation and Randomization are essential !
  • 22. MapReduce Model • Hardware is relatively cheap • Plenty of parallel algorithms designed but – Parallel programming is hard • Threaded programs are difficult to test, debug, synchronization issues, more machines mean more breakdown • MapReduce makes parallel programming easy
  • 23. MapReduce Model • MapReduce makes parallel programming easy – Tracks the jobs and restarts if needed – Takes care of data distribution and synchronization • But there is no free lunch: – Imposes a structure on the data – Only allows for certain kind of parallelism
  • 24. MapReduce Model • Data: – Represented as <Key, Value> pairs • Map: – Data  List < Key, Value> [programmer specified] • Shuffle: – Aggregate all pairs with the same key [handled by system] • Reduce: – <Key, List(Value)> <Key, List(Value)> [programmer specified]
  • 25. Distinct Elements in MapReduce • r servers • Data – [1,a1], [2,a2+,…., *n,am] • Map – [1,a1], [2,a2+,…., *n,am][1,a1], [1,a2+,…*1,am/r] ,[2,am/r+1+,…., *2,a2m/r+,….,*r,am] • Reduce – Reducer 1: [1,a1], [1,a2+,…*1,am/r]  [1,a1], [1,a2+,…*1,am/r] ,[1,h()] generates the hash function) – Reducer 2: [2,am/r+1], [2,am/r+2+,…*2,a2m/r] [2,am/r+1], [2,am/r+2+,…*2,a2m/r] – … • Map [1,a1], [1,a2+,…*1,am/r] ,[2,am/r+1+,…., *2,a2m/r+,….,*r,am],[1,h()] [1,a1], [1,a2+,…*1,am/r] ,[2,am/r+1+,…., *2,a2m/r+,….,*r,am+, *1,h()+, *2,h()+, …., *r,h()] makes multiple copies of the hash function for distribution • Reduce – Reducer 1: [1,a1], [1,a2+,…*1,aN/r] ,[1,h()], create sketch B1, outputs [1,B1] – …… • Map – [1,B1], [2,B2+,….,*r,Br] [1,B1], [1,B2+,….,*1,Br] gathers all the sketches • Reduce – Reducer1: [1,B1], [1,B2+,….,*1,Br], computes B= B1+B2+…..+Br, , Follows the Streaming Algorithm to compute distinct elements from the sketch
  • 26. Crowdsourcing • Incorporating human power for data gathering and computing • People still outperform state-of-the-art algorithms for many data intensive tasks – Typically involve ambiguity, deep understanding of language or context or subjective reasoning
  • 27. Distinct Elements by Crowdsourcing • Ask for each pair if they are equal • Create a graph with each element as node • Add an edge between two nodes if the corresponding pairs are returned to be equal • Return number of connected components • Also known as record linkage, entity resolution, deduplication
  • 28. Distinct Elements by Crowdsourcing Distinct elements=4
  • 29. Distinct Elements by Crowdsourcing Distinct elements=4 Too many questions to crowd ! Costly. Can we reduce the number of questions ?
  • 30. Scalable Algorithm Design Is linear time good enough ? – Don’t even read the entire data and return result !! – Hope: do not require exact answer Near linear time algorithm design Seemingly best one can do since reading data needs linear time
  • 31. Scalable Algorithm Design Is linear time good enough ? – Don’t even read the entire data and return result !! – Hope: do not require exact answer Near linear time algorithm design Seemingly best one can do since reading data needs linear time Property Testing
  • 32. “In the ballpark” vs. “out of the ballpark” tests • Distinguish inputs that have specific property from those that are far from having the property • Benefits: – May be the natural question to ask – May be just as good when data constantly changing – Gives fast sanity check to rule out very “bad” inputs (i.e., restaurant bills) or to decide when expensive processing is worth it
  • 33. Transactions of 20-30 yr olds Transactions of 30-40 yr olds trend change? Trend change analysis
  • 34. Outbreak of diseases • Do two diseases follow similar patterns? • Are they correlated with income level or zip code? • Are they more prevalent near certain areas?
  • 35. Is the lottery uniform? • New Jersey Pick-k Lottery (k =3,4) – Pick k digits in order. – 10k possible values. • Are all values equally likely to occur ?
  • 36. Global statistical properties: • Decisions based on samples of distribution • Properties: similarities, correlations, information content, distribution of data,…
  • 37. Another Example Pattern matching on Strings • Are two strings similar or not? (number of deletions/insertions to change one into the other) – Text – Website content – DNA sequences ACTGCTGTACTGACT (length 15) CATCTGTATTGAT (length 13) match size =11
  • 38. Pattern matching on Strings • Previous algorithms using classical techniques for computing edit distance on strings of size n use at least n2 time – For strings of size 1000, this is 1,000,000 – Can you compute edit distance in near linear time ? – Can you test whether two strings have edit distance below c (small) or above c’ (big) in sub-linear time ?
  • 39. Pattern matching on Strings • Previous algorithms using classical techniques for computing edit distance on strings of size n use at least n2 time – For strings of size 1000, this is 1,000,000 – Can we compute edit distance in near linear time ? • Various methods, key is metric embedding – Can we test whether two strings have edit distance below c (small) or above c’ (big) in sub-linear time ? • Property testing
  • 40. Metric Embedding • Metric space is a pair (X, d) where X is a set and d: X x X*0,∞+ is a metric satisfying i.) d(x,y)=0 if and only if x=y ii.) d(x,y)=d(y,x) iii.) d(x,y)+d(y,z) >= d(x,z) Dissimilarity matrix for bacterial strain in microbiology x1 x2 xn x1 x2 xn f: X  R2 d(xa, xb)=d(f(xa), f(xb)) 1.) succinct representation 2.) easy to understand structure 3.) efficient algorithms
  • 41. Metric Embedding • Metric space is a pair (X, d) where X is a set and d: X x X*0,∞+ is a metric satisfying i.) d(x,y)=0 if and only if x=y ii.) d(x,y)=d(y,x) iii.) d(x,y)+d(y,z) >= d(x,z) Dissimilarity matrix for bacterial strain in microbiology x1 x2 xn x1 x2 xn f: X  R2 d(xa, xb)=d(f(xa), f(xb)) 1.) succinct representation 2.) easy to understand structure 3.) efficient algorithms Isometric Embedding
  • 42. Metric Embedding • C-embedding: Embedding with distortion • Example – General n-point metric  Tree metric C=O(log n) – Specific metrics  normed space : edit distance (Levenshtein distance to low dimensional l1) – Graphs  t-spanner C=t – High dimensional spaces  low dimensional spaces : Johnson-Lindenstrauss theorem: flattening in l2 ,C=(1+epsilon) [Dimensionality Reduction] f: X  X’, (X,d) and (X,d’) are metric ∃ r ∊ (0,∞ ) s.t r d(xa, xb) <=d(f(xa), f(xb))<= C r d(xa, xb) Distortion=C
  • 43. Sparse Transformation • Sparse Fourier Transform • Sparse Johnson-Lindenstrauss – Fast dimensionality reduction
  • 44. • Discrete Fourier Transform: – Given: a signal x*1…n] – Goal: compute the frequency vector x’ where x’f = Σt xt e-2πi tf/n • Very useful tool: – Compression (audio, image, video) – Data analysis – Feature extraction – … • See SIGMOD’04 tutorial “Indexing and Mining Streams” by C. Faloutsos Fourier Transform Sampled Audio Data (Time) DFT of Audio Samples (Frequency)
  • 45. Computing DFT • Fast Fourier Transform (FFT) computes the frequencies in time O(n log n) • But, we can do (much) better if we only care about small number k of “dominant frequencies” – E.g., recover assume it is k-sparse (only k non-zero entries) • Exactly k-sparse signals: O(klog n) • Approx. k-sparse signals* : O(klog n * log(n/k))
  • 46. Agenda • Introduction to Data Streaming Model • Finding Frequent Items Deterministically • Lower bound for deterministic computation for distinct items • Interlude to concentration inequality • Analysis of counting distinct items algorithms