F15CS194Lec03TabularData.pptx

Introduction to Data Science
Lecture 3
Manipulating Tabular Data
Intro. to Data Science Fall 2015
John Canny
including notes from Michael Franklin and
others

Outline for this Evening
• Two views of tables:
– SQL/Pandas
– OLAP/Numpy/Matlab
• SQL, NoSQL
– Non-Tabular Structures

Data Science – One Definition

The Big Picture
4
Extract
Transform
Load

Key Concept: Structured Data
A data model is a collection of concepts for
describing data.
A schema is a description of a particular
collection of data, using a given data model.

The Relational Model*
• The Relational Model is Ubiquitous:
• MySQL, PostgreSQL, Oracle, DB2, SQLServer, …
• Foundational work done at
• IBM - System R
• UC Berkeley - Ingres
• Object-oriented concepts have been merged in
• Early work: POSTGRES research project at Berkeley
• Informix, IBM DB2, Oracle 8i
• Also has support for XML (semi-structured data)
*Codd, E. F. (1970). "A relational model of data for large shared data banks".
Communications of the ACM 13 (6): 37
E. F., “Ted” Codd
Turing Award 1981

Relational Database: Definitions
• Relational database: a set of relations
• Relation: made up of 2 parts:
Schema : specifies name of relation, plus name and
type of each column
Students(sid: string, name: string, login: string,
age: integer, gpa: real)
Instance : the actual data at a given time
• #rows = cardinality
• #fields = degree / arity
• A relation is a mathematical object (from set theory)
which is true for certain arguments.
• An instance defines the set of arguments for which the
relation is true (it’s a table not a row).

Ex: Instance of Students Relation
sid name login age gpa
53666 Jones jones@cs 18 3.4
53688 Smith smith@eecs 18 3.2
53650 Smith smith @math 19 3.8
• Cardinality = 3, arity = 5 , all rows distinct
• The relation is true for these tuples and false for others

SQL - A language for Relational DBs*
• SQL = Structured Query Language
• Data Definition Language (DDL)
– create, modify, delete relations
– specify constraints
– administer users, security, etc.
• Data Manipulation Language (DML)
– Specify queries to find tuples that satisfy criteria
– add, modify, remove tuples
• The DBMS is responsible for efficient evaluation.
* Developed at IBM by Donald D. Chamberlin and Raymond F. Boyce in the 1970s.
Used to be SEQUEL (Structured English QUEry Language)

Creating Relations in SQL
• Create the Students relation.
– Note: the type (domain) of each field is specified,
and enforced by the DBMS whenever tuples are
added or modified.
CREATE TABLE Students
(sid CHAR(20),
name CHAR(20),
login CHAR(10),
age INTEGER,
gpa FLOAT)

Table Creation (continued)
• Another example: the Enrolled table holds
information about courses students take.
CREATE TABLE Enrolled
(sid CHAR(20),
cid CHAR(20),
grade CHAR(2))

Adding and Deleting Tuples
• Can insert a single tuple using:
INSERT INTO Students (sid, name, login, age, gpa)
VALUES ('53688', 'Smith', 'smith@ee', 18, 3.2)
• Can delete all tuples satisfying some condition (e.g.,
name = Smith):
DELETE
FROM Students S
WHERE S.name = 'Smith'

Queries in SQL
• Single-table queries are straightforward.
• To find all 18 year old students, we can write:
SELECT *
FROM Students S
WHERE S.age=18
• To find just names and logins, replace the first line:
SELECT S.name, S.login

Joins and Inference
• Chaining relations together is the basic inference
method in relational DBs. It produces new
relations (effectively new facts) from the data:
SELECT S.name, M.mortality
FROM Students S, Mortality M
WHERE S.Race=M.Race
Name Race
Socrates Man
Thor God
Barney Dinosaur
Blarney stone Stone
Race Mortality
Man Mortal
God Immortal
Dinosaur Mortal
Stone Non-living
M
S

Joins and Inference
• Chaining relations together is the basic inference
method in relational DBs. It produces new
relations (effectively new facts) from the data:
SELECT S.name, M.mortality
FROM Students S, Mortality M
WHERE S.Race=M.Race
Name Mortality
Socrates Mortal
Thor Immortal
Barney Mortal
Blarney stone Non-living

Basic SQL Query
• relation-list : A list of relation names
• possibly with a range-variable after each name
• target-list : A list of attributes of tables in relation-list
• qualification : Comparisons combined using AND, OR and
NOT.
• Comparisons are Attr op const or Attr1 op Attr2, where
op is one of =≠<>≤≥
• DISTINCT: optional keyword indicating that the
answer should not contain duplicates.
• In SQL SELECT, the default is that duplicates are not
eliminated! (Result is called a “multiset”)
SELECT [DISTINCT] target-list
FROM relation-list
WHERE qualification

SQL Inner Joins
SELECT S.name, E.classid
FROM Students S (INNER) JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
11111 DataScience194
22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
E
S
Note the previous version of this query (with no join keyword) is an “Implicit join”

SQL Inner Joins
FROM Students S (INNER) JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
22222 French150
44444 English10
S.name E.classid
Jones History105
Smith French150
E
S
Unmatched keys

What kind of Join is this?
FROM Students S ?? Enrolled E
ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
22222 French150
44444 English10
S.name E.classid
Jones History105
Smith French150
Brown NULL
E
S

SQL Joins
FROM Students S LEFT OUTER JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
22222 French150
44444 English10
S.name E.classid
Jones History105
Smith French150
Brown NULL
E
S

ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
22222 French150
44444 English10
S.name E.classid
Jones History105
Smith French150
NULL English10
E
S

SQL Joins
FROM Students S RIGHT OUTER JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
22222 French150
44444 English10
S.name E.classid
Jones History105
Smith French150
NULL English10
E
S

SQL Joins
FROM Students S ? JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
22222 French150
44444 English10
S.name E.classid
Jones History105
Smith French150
NULL English10
Brown NULL
E
S

SQL Joins
FROM Students S FULL OUTER JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
22222 French150
44444 English10
S.name E.classid
Jones History105
Smith French150
NULL English10
Brown NULL
E
S

ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
22222 French150
44444 English10
S.name E.classid
Jones History105
Smith French150
E
S

SQL Joins
FROM Students S LEFT SEMI JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid
Jones 11111
Smith 22222
Brown 33333
E.sid E.classid
11111 History105
22222 French150
44444 English10
S.name E.classid
Jones History105
Smith French150
E
S

SELECT *
S.name S.sid
Jones 11111
Smith 22222
E.sid E.classid
11111 History105
22222 French150
S.name S.sid E.sid E.classid
Jones 11111 11111 History105
Jones 11111 11111 DataScience194
Jones 11111 22222 French150
Smith 22222 11111 History105
Smith 22222 11111 DataScience194
Smith 22222 22222 French150
E
S

SQL Joins
SELECT *
FROM Students S CROSS JOIN Enrolled E
S.name S.sid
Jones 11111
Smith 22222
E.sid E.classid
11111 History105
22222 French150
Smith 22222 11111 History105
Smith 22222 11111 DataScience194
E
S

SELECT *
FROM Students S, Enrolled E
WHERE S.sid <= E.sid
S.name S.sid
Jones 11111
Smith 22222
E.sid E.classid
11111 History105
22222 French150
E
S

Theta Joins
SELECT *
FROM Students S, Enrolled E
WHERE S.sid <= E.sid
S.name S.sid
Jones 11111
Smith 22222
E.sid E.classid
11111 History105
22222 French150
E
S

Recall: Tweet JSON Format
{"id"=>12296272736,
"text"=>
"An early look at Annotations:
http://groups.google.com/group/twitter-api-announce/browse_thread/thread/fa5da2608865453",
"created_at"=>"Fri Apr 16 17:55:46 +0000 2010",
"in_reply_to_user_id"=>nil,
"in_reply_to_screen_name"=>nil,
"in_reply_to_status_id"=>nil
"favorited"=>false,
"truncated"=>false,
"user"=>
{"id"=>6253282,
"screen_name"=>"twitterapi",
"name"=>"Twitter API",
"description"=>
"The Real Twitter API. I tweet about API changes, service issues and
happily answer questions about Twitter and our API. Don't get an answer? It's on my website.",
"url"=>"http://apiwiki.twitter.com",
"location"=>"San Francisco, CA",
"profile_background_color"=>"c1dfee",
"profile_background_image_url"=>
"http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png",
"profile_background_tile"=>false,
"profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png",
"profile_link_color"=>"0000ff",
"profile_sidebar_border_color"=>"87bc44",
"profile_sidebar_fill_color"=>"e0ff92",
"profile_text_color"=>"000000",
"created_at"=>"Wed May 23 06:01:13 +0000 2007",
"contributors_enabled"=>true,
"favourites_count"=>1,
"statuses_count"=>1628,
"friends_count"=>13,
"time_zone"=>"Pacific Time (US & Canada)",
"utc_offset"=>-28800,
"lang"=>"en",
"protected"=>false,
"followers_count"=>100581,
"geo_enabled"=>true,
"notifications"=>false,
"following"=>true,
"verified"=>true},
"contributors"=>[3191321],
"geo"=>nil,
"coordinates"=>nil,
"place"=>
{"id"=>"2b6ff8c22edd9576",
"url"=>"http://api.twitter.com/1/geo/id/2b6ff8c22edd9576.json",
"name"=>"SoMa",
"full_name"=>"SoMa, San Francisco",
"place_type"=>"neighborhood",
"country_code"=>"US",
"country"=>"The United States of America",
"bounding_box"=>
{"coordinates"=>
[[[-122.42284884, 37.76893497],
[-122.3964, 37.76893497],
[-122.3964, 37.78752897],
[-122.42284884, 37.78752897]]],
"type"=>"Polygon"}},
"source"=>"web"}
The tweet's unique ID. These
IDs are roughly sorted &
developers should treat them
as opaque (http://bit.ly/dCkppc).
Text of the tweet.
Consecutive duplicate tweets
are rejected. 140 character
max (http://bit.ly/4ud3he).
Tweet's
creation
date.
DEPRECATED
The ID of an existing tweet that
this tweet is in reply to. Won't
be set unless the author of the
referenced tweet is mentioned.
The screen name &
user ID of replied to
tweet author.
Truncated to 140
characters. Only
possible from SMS.
The
author
of
the
tweet.
This
embedded
object
can
get
out
of
sync.
The
author's
user
ID.
The author's
user name.
The author's
screen name.
The author's
biography.
The author's
URL.
The author's "location". This is a free-form text field, and
there are no guarantees on whether it can be geocoded.
Rendering information
for the author. Colors
are encoded in hex
values (RGB).
The creation date
for this account.
Whether this account has
contributors enabled
(http://bit.ly/50npuu). Number of
favorites this
user has.
Number
of
tweets
this
user
has.
Number of
users this user
is following.
The timezone and offset
(in seconds) for this user.
The user's selected
language.
Whether this user is protected
or not. If the user is protected,
then this tweet is not visible
except to "friends".
Number of
followers for
this user.
Whether
this
user
has
geo
enabled
(http://bit.ly/4pFY77).
DEPRECATED
in this context
Whether this user
has a verified badge.
The
geo
tag
on
this
tweet
in
GeoJSON
(http://bit.ly/b8L1Cp).
The contributors' (if any) user
IDs (http://bit.ly/50npuu).
DEPRECATED
The place associated with this
Tweet (http://bit.ly/b8L1Cp).
The place ID
The URL to fetch a detailed
polygon for this place
The printable names of this place
The type of this
place - can be a
"neighborhood"
or "city"
The country this place is in
The bounding
box for this
place
The application
that sent this
tweet
Map of a Twitter Status Object
Raffi Krikorian <raffi@twitter.com>
18 April 2010
{"id"=>12296272736,
"text"=>
"favorited"=>false,
"truncated"=>false,
"user"=>
{"id"=>6253282,
"description"=>
happily answer questions about Twitter and our API. Don't get an answer? It's on my website.
Text of the tweet.
DEPRECATED
The screen name &
tweet author.
Truncated to 140
characters. Only
possible from SMS.
.
This
ut
of
sync.
The
author's
user
ID.
The author's
user name.
The author's
screen name.

Normalization
Raw twitter data storage is very inefficient because, e.g. user
records are repeated with every tweet by that user.
Normalization is the process of minimizing data redundancy.
User.id Name Attribs…
111 Jones
222 Smith
Loc.id Name Attribs…
1111 Berkeley
2222 Oakland
3333 Hayward
Tweet id User id Location id Body
11 111 1111 I need a Jamba juice
22 111 1111 Cal Soccer rules
33 111 2222 Why do we procrastinate?
44 222 3333 Close your eyes and push “go”

Normalization
Normalized tables include only a foreign key to the information in
another table for repeated data.
The original table is the result of inner joins between tables.
User.id Name Attribs…
111 Jones
222 Smith
Loc.id Name Attribs…
1111 Berkeley
2222 Oakland
3333 Hayward
11 111 1111 I need a Jamba juice

Aggregate Queries
Including reference counts in the lookup tables allows you to
perform aggregate queries on those tables alone:
Average age of users, most popular location,…
U.id Name Count Attr..
111 Jones 3
222 Smith 1
L.id Name Count Attr…
1111 Berkeley 2
2222 Oakland 1
3333 Hayward 1
11 111 1111 I need a Jamba Juice

SQL Query Semantics
Semantics of an SQL query are defined in terms of the
following conceptual evaluation strategy:
1. do FROM clause: compute cross-product of tables (e.g.,
Students and Enrolled).
2. do WHERE clause: Check conditions, discard tuples
that fail. (i.e., “selection”).
3. do SELECT clause: Delete unwanted fields. (i.e.,
“projection”).
4. If DISTINCT specified, eliminate duplicate rows.
Probably the least efficient way to compute a query!
– An optimizer will find more efficient strategies to get
the same answer.

38
Data Model (Tabular)
• SQLite
– Table: fixed number of named columns of specified
type
– 5 storage classes for columns
• NULL
• INTEGER
• REAL
• TEXT
• BLOB
– Data stored on disk in a single file in row-major order
– Operations performed via sqlite3 shell

39
Reductions and GroupBy
• One of the most common operations on Data Tables is
aggregation or reduction (count, sum, average, min, max,…).
• They provide a means to see high-level patterns in the data,
to make summaries of it etc.
• You need ways of specifying which columns are being
aggregated over, which is the role of a GroupBy operator.
SID Name Course Semester Grade GPA
111 Jones Stat 134 F13 A 4.0
111 Jones CS 162 F13 B- 2.7
222 Smith EE 141 S14 B+ 3.3
222 Smith CS162 F14 C+ 2.3
222 Smith CS189 F14 A- 3.7

40
111 Jones Stat 134 F13 A 4.0
111 Jones CS 162 F13 B- 2.7
222 Smith EE 141 S14 B+ 3.3
222 Smith CS162 F14 C+ 2.3
222 Smith CS189 F14 A- 3.7
SELECT SID, Name, AVG(GPA)
FROM Students
GROUP BY SID

41
111 Jones Stat 134 F13 A 4.0
111 Jones CS 162 F13 B- 2.7
222 Smith EE 141 S14 B+ 3.3
222 Smith CS162 F14 C+ 2.3
222 Smith CS189 F14 A- 3.7
SELECT SID, Name, AVG(GPA)
FROM Students
GROUP BY SID
SID Name GPA
111 Jones 3.35
222 Smith 3.1

42
Pandas/Python
• Series: a named, ordered dictionary
– The keys of the dictionary are the indexes
– Built on NumPy’s ndarray
– Values can be any Numpy data type object
• DataFrame: a table with named columns
– Represented as a Dict (col_name -> series)
– Each Series object represents a column

43
Operations
• map() functions
• filter (apply predicate to rows)
• sort/group by
• aggregate: sum, count, average, max, min
• Pivot or reshape
• Relational:
– union, intersection, difference, cartesian product
(CROSS JOIN)
– select/filter, project
– join: natural join (INNER JOIN), theta join, semi-join,
etc.
– rename

44
Pandas vs SQL
+ Pandas is lightweight and fast.
+ Full SQL expressiveness plus the expressiveness of
Python, especially for function evaluation.
+ Integration with plotting functions like Matplotlib.
- Tables must fit into memory.
- No post-load indexing functionality: indices are built
when a table is created.
- No transactions, journaling, etc.
- Large, complex joins probably slower.

Jacobs Update
• Room 310 not ready, but other rooms are. For
the next few (?) weeks, we will meet:
• Mondays in 155 Donner
• Wednesdays in 110/120 Jacobs Hall
Starting this Weds.

The other view of tables: OLAP
• OnLine Analytical Processing
• Conceptually like an n-dimensional spreadsheet (Cube)
• (Discrete) columns become dimensions
• The goal is live interaction with numerical data for business
intelligence

The other view of tables: OLAP
From a table to a cube:
name classid Semester Grade Units
Jones History105 F13 3.3 4.0
Jones DataScience194 S12 4.0 3.0
Jones French150 F14 3.7 4.0
Smith History105 S15 2.3 3.0
Smith DataScience194 F14 2.7 3.0
Smith French150 F13 3.0 4.0

From tables to OLAP cubes
From a table to a cube:
Jones French150 F14 3.7 4.0
Smith History105 S15 2.3 3.0
Smith DataScience194 F14 2.7 3.0
Smith French150 F13 3.0 4.0
Variables used as qualifiers
(In where, GroupBy clauses)
Normally discrete
Variables we want to measure
Normally numeric

Constructing OLAP cubes
… … … … …
Name
Classid
Semester
Cell contents are Grade, Unit values
Cube
dimensions
Cube
values

Queries on OLAP cubes
• Once the cube is defined, its easy to do aggregate queries by
projecting along one or more axes.
• E.g. to get student GPAs, we project the Grade field onto the
student (Name) axis.
• In fact, such aggregates are precomputed and maintained
automatically in an OLAP cube, so queries are instantaneous.
Name
Classid
Semester
Cell contents are Grade, Unit values

OLAP
• Slicing:
fixing one or
more variables
• Dicing:
selecting a range of
values for one or
more variables

OLAP
• Drilling Up/Down
(change levels of a
hierarchically-indexed
variable)
• Pivoting:
produce a two-axis
view for viewing
as a spreadsheet.

Outline
• To support real-time querying, OLAP DBs store aggregates
of data values along many dimensions.
• This works best if axes can be tree-structured. E.g time can
be expressed as a hierarchy
hour  day  week  month  year

OLAP tradeoffs
• Aggregates increase space and the cost of updates.
• On the other hand, since they are projections of data, or
tree structures, the storage overhead can be small.
• Aggregates are limited, but cover a lot of common cases:
avg, stdev, min, max.
• Operations (slice, dice, pivot, etc.) are conceptually simpler
than SQL, but cover a lot of common cases.
• Good integration with clients, e.g. spreadsheets, for visual
interaction, although there is an underlying query
language (MDX).

Numpy/Matlab and OLAP
• Numpy and Matlab have an efficient implementation of nd-
arrays for dense data.
• Indices must be integer, but you can implement general
indices using dictionaries from indexval->int.
• Slicing and dicing are available using index ranges:
a[5,1:3,:] etc.
• Roll-down/up involve aggregates along dimensions such as
sum(a[3,4:6,:],2)
• Pivoting involves index permutations (.transpose()) and
aggregation over the other indices.
• Limitation: MATLAB and Numpy currently only support dense
nd-arrays (or sparse 2d arrays).

What’s Wrong with Tables?
• Too limited in structure?
• Too rigid?
• Too old fashioned?

What’s Wrong with (RDBMS) Tables?
• Indices: Typical RDBMS table storage is mostly indices
– Cant afford this overhead for large datastores
• Transactions:
– Safe state changes require journals etc., and are slow
• Relations:
– Checking relations adds further overhead to updates
• Sparse Data Support:
– RDBMS Tables are very wasteful when data is very sparse
– Very sparse data is common in modern data stores
– RDBMS tables might have dozens of columns, modern data
stores might have many thousands.

RDBMS tables – row based
Table:
Represented as:
sid name login age gpa
53831 Smith smith@ee 18 3.2
53831 Smith smith@ee 18 3.2

Tweet JSON Format
{"id"=>12296272736,
"text"=>
"favorited"=>false,
"truncated"=>false,
"user"=>
{"id"=>6253282,
"description"=>
happily answer questions about Twitter and our API. Don't get an answer? It's on my website.",
"url"=>"http://apiwiki.twitter.com",
"location"=>"San Francisco, CA",
"profile_background_color"=>"c1dfee",
"profile_background_image_url"=>
"http://a3.twimg.com/profile_background_images/59931895/twitterapi-background-new.png",
"profile_background_tile"=>false,
"profile_image_url"=>"http://a3.twimg.com/profile_images/689684365/api_normal.png",
"profile_link_color"=>"0000ff",
"profile_sidebar_border_color"=>"87bc44",
"profile_sidebar_fill_color"=>"e0ff92",
"profile_text_color"=>"000000",
"created_at"=>"Wed May 23 06:01:13 +0000 2007",
"contributors_enabled"=>true,
"favourites_count"=>1,
"statuses_count"=>1628,
"friends_count"=>13,
"time_zone"=>"Pacific Time (US & Canada)",
"utc_offset"=>-28800,
"lang"=>"en",
"protected"=>false,
"followers_count"=>100581,
"geo_enabled"=>true,
"notifications"=>false,
"following"=>true,
"verified"=>true},
"contributors"=>[3191321],
"geo"=>nil,
"coordinates"=>nil,
"place"=>
{"id"=>"2b6ff8c22edd9576",
"url"=>"http://api.twitter.com/1/geo/id/2b6ff8c22edd9576.json",
"name"=>"SoMa",
"full_name"=>"SoMa, San Francisco",
"place_type"=>"neighborhood",
"country_code"=>"US",
"country"=>"The United States of America",
"bounding_box"=>
{"coordinates"=>
[[[-122.42284884, 37.76893497],
[-122.3964, 37.76893497],
[-122.3964, 37.78752897],
[-122.42284884, 37.78752897]]],
"type"=>"Polygon"}},
"source"=>"web"}
Text of the tweet.
Tweet's
creation
date.
DEPRECATED
The screen name &
tweet author.
Truncated to 140
characters. Only
possible from SMS.
The
author
of
the
tweet.
This
embedded
object
can
get
out
of
sync.
The
author's
user
ID.
The author's
user name.
The author's
screen name.
The author's
biography.
The author's
URL.
The author's "location". This is a free-form text field, and
there are no guarantees on whether it can be geocoded.
Rendering information
for the author. Colors
are encoded in hex
values (RGB).
The creation date
for this account.
Whether this account has
contributors enabled
(http://bit.ly/50npuu). Number of
favorites this
user has.
Number
of
tweets
this
user
has.
Number of
users this user
is following.
The timezone and offset
(in seconds) for this user.
The user's selected
language.
Whether this user is protected
or not. If the user is protected,
then this tweet is not visible
except to "friends".
Number of
followers for
this user.
Whether
this
user
has
geo
enabled
(http://bit.ly/4pFY77).
DEPRECATED
in this context
Whether this user
has a verified badge.
The
geo
tag
on
this
tweet
in
GeoJSON
(http://bit.ly/b8L1Cp).
The contributors' (if any) user
IDs (http://bit.ly/50npuu).
DEPRECATED
The place associated with this
Tweet (http://bit.ly/b8L1Cp).
The place ID
The URL to fetch a detailed
polygon for this place
The printable names of this place
The type of this
place - can be a
"neighborhood"
or "city"
The country this place is in
The bounding
box for this
place
The application
that sent this
tweet
Map of a Twitter Status Object
Raffi Krikorian <raffi@twitter.com>
18 April 2010
{"id"=>12296272736,
"text"=>
"favorited"=>false,
"truncated"=>false,
"user"=>
{"id"=>6253282,
"description"=>
happily answer questions about Twitter and our API. Don't get an answer? It's on my website.
Text of the tweet.
DEPRECATED
The screen name &
tweet author.
Truncated to 140
characters. Only
possible from SMS.
.
This
ut
of
sync.
The
author's
user
ID.
The author's
user name.
The author's
screen name.

RDBMS tables – row based
Table:
Represented as:
ID name login loc locid LAT LONG ALT State
52841 Jones jones@cs NULL NULL NULL NULL NULL NULL
53831 Smith smith@ee NULL NULL NULL NULL NULL NULL
55541 Brown brown@ee NULL NULL NULL NULL NULL NULL
52841 Jones jones@cs NULL NULL NULL NULL NULL NULL

Column-based store
Table:
Represented as column (key-value) stores:
ID name login loc locid LAT LONG ALT State
52841 Jones jones@cs Albany 2341 38.4 122.7 100 CA
ID name
52841 Jones
53831 Smith
55541 Brown
ID login
52841 jones@cs
53831 smith@ee
55541 brown@ee
ID loc
52841 Albany
ID locid
52841 2341
ID LAT
52841 38.4
ID LONG
52841 122.7
…

Column-Family Stores (Cassandra)
A column-family groups data columns together, and is
analogous to a table (and similar to Pandas DataFrame)
Static column family from Apache Cassandra:
Dynamic Column family (Cassandra):
Can add or
remove columns
from a dynamic
column family
Columns fixed

CouchDB Data Model (JSON)
• “With CouchDB, no schema is enforced, so new document
types with new meaning can be safely added alongside
the old.”
• A CouchDB document is an object that consists of named
fields. Field values may be:
– strings, numbers, dates,
– ordered lists, associative maps
66
"Subject": "I like Plankton"
"Author": "Rusty"
"PostedDate": "5/23/2006"
"Tags": ["plankton", "baseball", "decisions"]
"Body": "I decided today that I don't like baseball. I like plankton."

Key-value stores
• A key-value store is an even simpler approach.
• It implements storage and retrieval of (key,value) pairs.
• i.e. Basic functionality is that of a dictionary
age[“john”] = 25.
• But some KV-stores also implement sorting and
indexing with the keys (e.g. leveldb).
• You can build either column-based or row-based DBs
on top of such KV-stores to optimize performance (e.g.
omitting indices or ACID qualities).
67

Pig
• Started at Yahoo! Research
• Features:
– Expresses sequences of MapReduce jobs
– Data model: nested “bags” of items
• Schema is optional
– Provides relational (SQL) operators
(JOIN, GROUP BY, etc)
– Easy to plug in Java functions

An Example Problem
Suppose you have user
data in one file, website
data in another, and you
need to find the top 5
most visited pages by
users aged 18-25.
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt

In MapReduce

Users = load ‘users’ as (name, age);
Filtered = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
In Pig Latin

Hive
• Developed at Facebook
• Relational database built on Hadoop
– Maintains table schemas
– SQL-like query language (which can also call
Hadoop Streaming scripts)
– Supports table partitioning,
complex data types, sampling,
some query optimization
• Used for most Facebook jobs
– Less than 1% of daily jobs at Facebook use
MapReduce directly!!! (SQL – or PIG – wins!)
– Note: Google also has several SQL-like systems in use.

Summary
• Two views of tables:
– SQL/Pandas
– OLAP/Numpy/Matlab
• SQL, NoSQL
– Non-Tabular Structures
Wednesday come to 110/120 Jacobs Hall for
Pandas Lab

F15CS194Lec03TabularData.pptx

Recommended

Recommended

More Related Content

Similar to F15CS194Lec03TabularData.pptx

Similar to F15CS194Lec03TabularData.pptx (20)

Recently uploaded

Recently uploaded (20)

F15CS194Lec03TabularData.pptx