Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

NO-SQL PYTHON
Aileen Nielsen
Software Engineer, One Drop, NYC
aileen@onedrop.today
1

OUTLINE
1. WHY?( O THER THAN THE TRENDY NAME)
2. HOW?
3. WHY? ( AGAIN)
2

LET’S START WITH STANDARD
SQL-LIKE, TIDY DATA
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
4

SQL-LIKE, TIDY DATA
What makes this data tidy? Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
5

SQL-LIKE, TIDY DATA
What makes this data tidy?
• Observations are in rows
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
6

SQL-LIKE, TIDY DATA
• Variables are in columns
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
7

SQL-LIKE, TIDY DATA
• Contained in a single data set
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
8

SQL-LIKE, TIDY DATA
• Contained in a single data set
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
But can you tell me
anything useful
about this data set?
9

SQL-LIKE, TIDY DATA
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
Sure.These are easy
to see:
• Highest score
• Lowest score
• Total observations
10

SQL-LIKE, TIDY DATA
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
Not-so-easy
• How many people?
• Who’s doing the best?
• Who’s doing the worst?
• How are individuals doing?
11

HOW ABOUT NOW?
What Changed?
• The data’s still tidy,but we’ve changed
the organizing principle
Name Score Day
Allen 25 1
Mary 11 1
Joe 1 1
Mary 14 2
Joe 14 2
Joe 17 3
Allen 9 3
Mary 9 3
12

OK HOW ABOUT NOW?
(LAST TIME I PROMISE)
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
13

OK HOW ABOUT NOW?
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
This data’s NOTTIDY but...
14

OK HOW ABOUT NOW?
I can eyeball it easily
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
15

OK HOW ABOUT NOW?
I can eyeball it easily
And new questions become
interesting and easier to
answer
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
16

OK HOW ABOUT NOW?
• How many students are there?
• Who improved?
• Who missed a test?
• Who was kind of meh?
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
17

DON’T GET MAD
I’m not saying to kill tidy Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
18

DON’T GET MAD
I’m not saying to kill tidy
But I worry we don’t use certain methods
more often because it’s not as easy as it
could be.
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
19

BEFORE I GOT INTO THE ’NOSQL’ MINDSET I
SIGHED WHEN ASKED QUESTIONS LIKE…
• App analytics What usage patterns do we see in our
long-term app users? How do those patterns evolve over
time at the individual level?
20

BEFORE I GOT INTO THE ’NOSQL’ MINDSET I
SIGHED WHEN ASKED QUESTIONS LIKE
THESE
• Health researchCan we predict early on in an
experiment what’s likely to happen? Do our experiments
need to be as long as they are?
21

BEFORE I GOT INTO THE ’NOSQL’
MINDSET I SIGHED THINKING ABOUT…
• Health researchCan we predict early on in an
experiment what’s likely to happen? Do our experiments
need to be as long as they are?
• Consumer researchDo people like things because they
like them or because of the ordering they saw them in?
22

I ALSO TENDED NOT TO ASK NO-SQL
QUESTIONS TOO OFTEN
• Status quo bias:humans tend to take whatever default is
presented.That happens in data analysis too.
23

QUESTIONS TOO OFTEN
• Endowment effect: humans tend to want what they already
have and think it’s more valuable than what’s offered for a trade.
24

QUESTIONS TOO OFTEN
• Endowment effect: humans tend to want what they already
have and think it’s more valuable than what’s offered for a trade.
• Especially deep finding: humans are lazy
25

Option 1:
>>> no_sql_df = df.groupby('Name').apply(lambda df: list(df.sort_values(by='Day')['Score']))
>>> no_sql_df
Name
Allen [25, 9]
Joe [1, 14, 17]
Mary [11, 14, 9]
26
IT’S TRUE.YOU CAN ANSWER THESE
QUESTIONS WITH THE TIDY DATA FRAMES I
JUST SHOWED YOU.
You can always ’reconstruct’ these trajectories of
what happened by making a data frame per user
>>> df
Name Day Score
0 Allen 1 25
1 Joe 3 17
2 Joe 2 14
3 Mary 2 14
4 Mary 1 11
5 Allen 3 9
6 Mary 3 9
7 Joe 1 1

JUST SHOWED YOU.
Option 2:
>>> new_list = []
>>> for tuple in df.groupby(['Name']):
... new_list.append({tuple[0]: zip(tuple[1]['Day'], tuple[1]['Score'])})
...
>>> new_list
[{'Allen': [(1, 25), (3, 9)]}, {'Joe': [(3, 17), (2, 14), (1, 1)]}, {'Mary': [(2, 14), (1,
11), (3, 9)]}]
27
>>> df
Name Day Score
0 Allen 1 25
1 Joe 3 17
2 Joe 2 14
3 Mary 2 14
4 Mary 1 11
5 Allen 3 9
6 Mary 3 9
7 Joe 1 1

JUST SHOWED YOU.
Option 3:
>>> def process(new_df):
... return [new_df[new_df['Day']==i]['Score'].values[0] if i in list(new_df['Day'])
else None for i in range(1,4)]
...
>>> df.groupby(['Name']).apply(process)
Name
Allen [25, None, 9]
Joe [1, 14, 17]
Mary [11, 14, 9]
28
>>> df
Name Day Score
0 Allen 1 25
1 Joe 3 17
2 Joe 2 14
3 Mary 2 14
4 Mary 1 11
5 Allen 3 9
6 Mary 3 9
7 Joe 1 1

LET’S BE HONEST…NO ONE WANTS THAT
TO BE A FIRST STEP TO EVERY QUERY...AND
INCREASINGLY WE’LL BE REQUIRED TO
MAKE THESE SORTS OF QUERIES
• Google ads (well…maybeless so in Europe)
29

• Wearable sensors
30

• The unit of an observationshould be the actor not the particular
action observedat a particular time.
31

• The unit of an observationshould be the actor not the particular
action observedat a particular time.
• Maybe we should rethink what we mean by‘observations’
32

• High scalability
• Distributed computing
• Schema flexibility
• Semi-structured data
• No complex relationships
• Schema change all the time
• Patterns change all the time
• Same units of interest
repeating new things
33
We don’t look for No-SQL because we have No-SQL databases...We
have No-SQL Databases because we have no-SQL data.

WHAT IS NO-SQL PYTHON?
Data that doesn’t seem like it fits in a data frame
• Arbitrarily nested data
• Ragged data
• Comparativetime series
34

WHERE DO WE FIND NO-SQL DATA?
Here’s where I’ve found it…
• Physics lab
• Running data
• Health data
• Reddit
35

GETTING THE DATA INTO PYTHON
WHEN IT’S STRAIGHTFORWARD
• Scenario:you’re grabbing bunch of NoSQL data from an
API or from a NoSQL db.
• We’ll stick with JSON since it’s a common format.
• Best case scenario.You’ll take everythinghowever you can
get it. In this case stick with pandas.The normalize_json
works great.
37

NORMALIZE_JSON WORKS PRETTY
WELL
38
{"samples": [
{ "name":"Jane Doe",
"age" : 42,
"profession": "architect",
"series": [
{
"day":0,
"measurement_value": 0.97
},
{
"day":1,
},
{
"day":2,
}
]},
{ name":"Bob Smith",
"hobbies":["tennis", "cooking"],
"age": 37,
"series":
{
"day": 0,
} }]}

WELL
39
{"samples": [
"age" : 42,
"series": [
{
"day":0,
},
{
"day":1,
},
{
"day":2,
}
]},
"age": 37,
"series":
{
"day": 0,
} }]}

WELL
40
{"samples": [
"age" : 42,
"series": [
{
"day":0,
},
{
"day":1,
},
{
"day":2,
}
]},
"age": 37,
"series":
{
"day": 0,
} }]}

WELL
41
{"samples": [
{ "name": "Jane Doe",
"age" : 42,
"profession":"architect",
"series": [
{
"day": 0,
},
{
"day": 1,
},
{
"day": 2,
}
]},
{ name": "Bob Smith",
"hobbies": ["tennis", "cooking"],
"age": 37,
"series":
{
"day": 0,
} }]}
>>with open(json_file) as data_file:
>> data = json.load(data_file)
>> normalized_data =
json_normalize(data['samples'])
Easy to process

WELL
42
{"samples": [
"age" : 42,
"series": [
{
"day": 0,
},
{
"day": 1,
},
{
"day": 2,
}
]},
"age": 37,
"series":
{
"day": 0,
} }]}
Easy to process
>> print(normalized_data['series'][0])[1]
>> {u'measurement_value': 1.55, u'day': 1}
Basically,it just works

WELL
43
{"samples": [
"age" : 42,
"series": [
{
"day": 0,
},
{
"day": 1,
},
{
"day": 2,
}
]},
"age": 37,
"series":
{
"day": 0,
} }]}
Easy to process
Easy to add columns
>> normalized_data['length'] =
normalized_data['series'].apply(len)
>> print(normalized_data['series'][0])[1]
>> {u'measurement_value': 1.55, u'day': 1}
Basically,it just works

USING SOME PROGRAMMER STUFF ALSO HELPS
44
class dfList(list):
def __init__(self, originalValue):
if originalValue.__class__ is list().__class__:
self = originalValue
else:
self = list(originalValue)
def __getitem__(self, item):
result = list.__getitem__(self, item)
try:
return result[ITEM_TO_GET]
except:
return result
def __iter__(self):
for i in range(len(self)):
yield self.__getitem__(i)
def __call__(self):
return sum(self)/list.__len__(self)
• Subclass an iterable to shorten your
apply() calls

45
class dfList(list):
else:
try:
except:
return result
def __iter__(self):
def __call__(self):
apply() calls
• In particular,you need to subclass at
least __getitem__ and __iter__

46
class dfList(list):
else:
try:
except:
return result
def __iter__(self):
def __call__(self):
apply() calls
• You should probably subclass __init__
as well for the case of inconsistent
format

47
class dfList(list):
else:
try:
except:
return result
def __iter__(self):
def __call__(self):
apply() calls
• You should probably subclass __init__
as well for the case of inconsistent
format
• Then __call__ can be a catch-all
adjustable function...best to loadit up
with a call to a class function, which
you can adjust at-will anytime.

CUSTOM CLASSES PAIR NICELY WITH
CLASS METHODS
48
class Test:
def __init__(self, name)
self.name1 = name
def print_class_instance(instance):
print(instance.name1)
def print_self(self):
self.__class__.print_class_instance(self)
>>> test1 = Test('test1')
>>> test1.print_self()
test1
>>> def new_printing(instance):
... print("Now I'm printing a constant string")
...
test1
>>> Test.print_class_instance = new_printing
Now I'm printing a constant string
• Design flexible classes that often
reference class methods rather
than instance methods
• Then as you are processing data,
you can quickly swap out methods
to call different field names in the
event of highly nested JSON
• Data processing is faster and no
mental gymnastics or annoying
parse efforts required

GETTING NOSQL DATA: COMMONLY-
ENCOUNTERED PROBLEMS
• CSVs with arrays
• Highly-nested JSON
• Unknown or Unreliably formattedAPI results
49

SOMETIMES YOU GET WEIRD CSV FILES…
• Sometimes your problem is as simple as getting a csv file with nested data
50

• This is pretty straightforward to deal with…use regex and common Python
string operations to clean up the data
51

• This is pretty straightforward to deal with…use regex and common Python
string operations to clean up the data
• Apply() is your best friend
52

• This is pretty straightforward to deal with…use regex and common Python string
operations to clean up the data
• Apply() is your best friend
• Common problems:spaces between “,” and column name or column value (df =
pd.read_csv("in.csv",sep="," , skipinitialspace=1)) use a parameter to avoid this
problem
53

name,favorites,age
joe,"[madonna,elvis,u2]",28
mary,"[lady gaga, adele]",36
allen,"[beatles, u2, adele, rolling stones]"
This isn’t even that weird

name,favorites,age
>> df = pd.read_csv(file_name, sep =",")
Downright straightforward

56
name,favorites,age
Hmmm….
>> print(df['favorites'][0][1])
>> m

57
name,favorites,age
Hmmm….
>> m
Regex to the rescue…Python’s exceptionally easy string parsing a huge asset for No-SQL parsing
>> df['favorites'] = df['favorites'].apply(lambda s: s[1:-1].split())
>> adele

WHAT ABOUT THIS ONE?
58
name,favorites,age
joe,[madonna,elvis,u2],28
mary,[lady gaga, adele],36
allen,[beatles, u2, adele, rolling stones]
Downright straightforward?
Actually this fails miserably
>> print(df['favorites'])
>> joe [madonna elvis u2]
mary [lady gaga adele] 36
Name: name, dtype: object
We need more regex…this time before applying read_csv()....

59
name,favorites,age
Missing quotes arouns arrays:
Basically,put in a the quotation marks to help out read_csv()

60
name,favorites,age
pattern = "([.*])"
with open(file_name) as f:
for line in f:
new_line = line
match = re.finditer(pattern, line)
try:
m = match.next()
while m:
replacement = '"'+m.group(1)+'"'
new_line = new_line.replace(m.group(1), replacement)
m = match.next()
except:
pass
with open(write_file, 'a') as write_f:
write_f.write(new_line)
new_df = pd.read_csv(write_file)

61
name,favorites,age
pattern = "([.*])"
with open(file_name) as f:
for line in f:
new_line = line
match = re.finditer(pattern, line)
try:
m = match.next()
while m:
replacement = '"'+m.group(1)+'"'
new_line = new_line.replace(m.group(1), replacement)
m = match.next()
except:
pass
with open(write_file, 'a') as write_f:
write_f.write(new_line)
new_df = pd.read_csv(write_file)
With multiple arrays per row,you’re gonna need to accommodate the greedy nature of regex
pattern = "([.*?])"

62
THAT WAS A LOT OF TEXT…
ALMOST DONE

SOMETIMES YOU GET JSON AND YOU KNOW
THE STRUCTURE,YOU JUST DON’T LIKE IT
• Use json_normalize()and then shed columns you don’t want.You’ve seen that
today already (slides 32-38).
• Use some magic: sh with jq module to simplify your life…you can pick out the
fields you want with jq either on the command line or with sh
• jq has a straightforward,easy to learn syntax:. = value,[] = array operation,etc… 63
cat = sh.cat
jq = sh.jq
rule = """[{name: .samples[].name, days: .samples[].series[].day}]""”
out = jq(rule, cat(_in=json_data)).stdout
json.loads(uni_out)

AND SOMETIMES YOU HAVE NO IDEA
WHAT’S IN AN ENORMOUS JSON FILE
• Inconsistent or undocumentedAPI
• Legacy Mongo database
• Someone handed you some gnarly JSON because they
couldn’t parse it
64

YOU’RE A PROGRAMMER…USE ITERATORS
• The ijson module is an iterator JSON parser…you can
deal with structure one bit at a time
• This also gives you a great opportunity to make data
parsing decisions as you go
• This isn’t fast,but it’s also not fast to shoot from the hip
when you’re talking about gnarly JSON
65

66
with open(file_name, 'rb') as file:
results = ijson.items(file, "samples.item")
for newRecord in results:
record = newRecord
for k in record.keys():
if isinstance(record[k], dict().__class__):
recursive_check(record[k])
if isinstance(record[k], list().__class__):
recursive_check(record[k])
process(record)

67
total_dict = defaultdict(lambda: False)
def recursive_check(d):
if isinstance(d, dict().__class__):
if not total_dict[tuple(sorted(d.keys()))]:
class_name = raw_input("Input the new classname with a
space and then the file name defining the class ")
mod = import_module(class_name)
cls = getattr(mod, class_name)
total_dict[tuple(sorted(d.keys()))] = cls
for k in d.keys():
new_class = recursive_check(k)
if new_class:
d[k] = new_class(**d[k])
return total_dict[tuple(sorted(d.keys()))]
elif isinstance(d, list().__class__):
for i in range(len(d)):
new_class = recursive_check(d[i])
if new_class:
d[i] = new_class(**d[i])
else:
return False
• Basically,you can build custom classes
or generate appropriate named tuples
as you go.
• This lets you know what you have and
lets you build data structures to
accommodate what you have.
• Storing these objects in a class rather
than simple dictionary again gives you
the option to customize .__call__()
to your needs

68
total_dict = defaultdict(lambda: False)
def recursive_check(d):
if isinstance(d, dict().__class__):
if not total_dict[tuple(sorted(d.keys()))]:
class_name = raw_input("Input the new classname with a space and then the file name defining the class ")
mod = import_module(class_name)
cls = getattr(mod, class_name)
total_dict[tuple(sorted(d.keys()))] = cls
for k in d.keys():
new_class = recursive_check(k)
if new_class:
d[k] = new_class(**d[k])
return total_dict[tuple(sorted(d.keys()))]
elif isinstance(d, list().__class__):
for i in range(len(d)):
new_class = recursive_check(d[i])
if new_class:
d[i] = new_class(**d[i])
else:
return False
• Basically, you can build custom classes or
generate appropriate Named tuples as you go.
This lets you know what you have and lets you
build data structures to accommodate what you
have.
• Again remember that class methods can easily
be adjusted dynamically,so it’s good to code
classes with instances that reference class
methods.

CLUSTERING TIME SERIES
• Reports of clustering and classifying time
series are surprisingly rare
• Methods are computationallydemanding
O(N2)… but we’re getting there
• Relatedly‘classification’ can also be used for
series-related predictions
• Can use many commonly applied clustering
algorithms once you get the distance metric
70
http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/slides/S12P01slides.pdf

WHEN DO PEOPLE GO RUNNING?
72
Actually,I made these plots with R…

NANO-SCALE PHYSICS
73
Meisner et al, J. Am. Chem. Soc. 2012, 134, 20440−20445
• You can build an electrical circuit which has
a single molecule as its narrowest part
• It turns out it’s quite easy to distinguish
different molecules depending on their
trajectory as you pull on them
• Particularly their summed behavior looks
quite different
• Suggests that we could cluster and identify
individual measurements with reasonable
certainty

74
• Several months of pulling the top 25 threads off Reddit’s
front page shows significantly different trends for
different subreddits.
REDDIT

75
REDDIT

76
• Some kinds of posts don’t last long (r/TwoX and r/videos)
• r/personalfinance shows a remarkable ability to have a
second peak/second life on the front page
• r/videos do great but burn out quickly
REDDIT

QUICK: HOW IT WORKS II.
78
• O(N2) in theory
• Various lower bounding techniques
significantly reduce processing time
• Dynamic programmingproblem
http://wearables.cc.gatech.edu/paper_of_week/DTW_myths.pdf

WHY THE FANCY METHOD?
79
Euclidean distance matches ts3
to ts1,despite our intuition that
ts1 and ts2 are more alike.
http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/slides/S12P01slides.pdf
http://nbviewer.jupyter.org/github/alexmi
nnaar/time-series-classification-and
clustering/blob/master/Time%20Series%
20Classification%20and%20Clustering.ipy
nb

BIKE-SHARING STANDS
80http://ofdataandscience.blogspot.nl/2013/03/capital-bikeshare-time-series-clustering.html?m=1
http://ofdataandscience.blogspot.nl/2013/03/capital-bikeshare-time-series-clustering.html?m=1

FUTURE RESEARCH POSSIBILITIES
81
http://wearables.cc.gatech.edu/paper_of_week/DTW_myths.pdf

WHY?
Time series classification and related metrics can be one more thing to
know…or even several more things to know
82
Name Ordered
Scores
Score
Trajectory Type
Number of
Tests
Predicted
Score For
Next Test
Joe [1, 14, 17] good 3 19
Mary [11, 14, 9] meh 3 11
Allen [25, NA, 9] underachiever 2 35
Info from classification
Info from prediction
Info from
easy apply()
calls

THE SHORT VERSION
• Pandas is already well-adapted to the No-SQL world
83

THE SHORT VERSION
• Make your data format work for you
84

THE SHORT VERSION
• Comparative time series go hand-in-hand with the increasing
availability of No-SQL data. Everything is a time series if you
look hard enough.
85

THE SHORT VERSION
• Comparative time series go hand-in-hand with the increasing
availability of No-SQL data. Everything is a time series if you
look hard enough.
• Non-time series collections are also informative.This was just
one example of what you can do.
86

Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

Recommended

Recommended

More Related Content

Similar to Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

Similar to Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world (20)

More from PyData

More from PyData (20)

Recently uploaded

Recently uploaded (20)

Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world