Drawn from Think Big's experience on real-world client projects, Think Big Academy Director and Principal Architect Jeffrey Breen will review specific ways to integrate NoSQL databases into Hadoop-based Big Data systems: preserving state in otherwise stateless processes; storing pre-computed metrics and aggregates to enable interactive analytics and reporting; and building a secondary index to provide low latency, random access to data stored stored on the high latency HDFS. A working example of secondary indexing is presented in which MongoDB is used to index web site visitor locations from Omniture clickstream data stored on HDFS.
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
NoSQL to Augment Hadoop Big Data Platforms
1. Jeffrey Breen
Director, Think Big Academy
October 2014
NoSQL to Augment Hadoop Big Data Platforms
2. CONFIDENTIAL | 2
Outline
• Introduction
• Hadoop and NoSQL: What? Where? Why? When?
• Document-Oriented NoSQL and Hadoop
• Example: Add Statefulness
• Example: Analytics Store
• Example: Secondary Index
− Caution: contains code
• MongoDB Connector for Hadoop
CONFIDENTIAL 2
3. Leading Provider of Big Data Solutions
& Support
CONFIDENTIAL | 3
Delivering Business Value Through Big Data
Exclusive Focus
on Big Data Tools,
Technologies, and
Techniques
Onshore Team-
Based Engineering
and Data Science
Methodology
Prebuilt, Proven
Components to
Accelerate
Delivery & Lower
Risk
4. CONFIDENTIAL | 4
Agile Methodology
Experiment-Driven Short Sprints with
Quick Release Cycles
We Accelerate Your
Time to Value
Breaking Down Business and IT Barriers
Discrete Projects with Beginning and End
Early Releases to Validate ROI and
Ensure Long Term Success
DATA ENGINEERS
DATA SCIENTISTS
BUSINESS GOALS
Innovation
and Value
5. CONFIDENTIAL | 5
Jeffrey Breen
Director, Think Big Academy
Principal Consultant and Hands-on Architect
IT guy, Data guy, Open Source guy
Pilot and Airplane Geek
Twitter: @JeffreyBreen
jeffrey.breen@thinkbiganalytics.com
CONFIDENTIAL 5
6. CONFIDENTIAL | 6
Hadoop and NoSQL
• Not “either-or”
− When together? Where? For what?
• Hadoop
− Not a database
− Low cost storage with fault tolerance
− Batch-oriented analytics (MapReduce, Hive, Pig)
− Not good for random access and/or updates
• NoSQL
− Real databases with CRUD
− Optimized for fast, random access
− Many shapes and sizes (key-val, tabular, graph, document oriented)
CONFIDENTIAL 6
8. CONFIDENTIAL | 8
Document-Oriented NoSQL with Hadoop
• Advantages
− Simple but flexible data model
− Field-level indexing for fast querying
− Easy and open APIs and data exchange formats
• Examples
1. Add Statefulness. Preserve state between jobs and other stateless
operations.
2. Analytics Store. Provide high performance destination for calculations
and metrics.
3. Secondary Indexing. Add low-latency querying and access for high-latency
data stores like HDFS.
CONFIDENTIAL 8
9. CONFIDENTIAL | 9
Overview
- Sometimes you just need a fast
and safe place to store data
between jobs, applications,
iterations
Scenarios
- Data extraction jobs
- Ingestion processing status
- Broadcasting “last best”
parameters in machine
learning, genetic algorithms,
and other model fitting
{
"process": "db-extractor",
"system": "database1",
"tables": {
"table1": { "columns": ["ts"],
"values": ["2014-03-25
03:15:23"] },
"table2": { "columns": [
"client_id" ],
"values": ["43110221"] }
}
}
Example: Add Statefulness
CONFIDENTIAL 9
10. CONFIDENTIAL | Example: Analytics Store
• Great place to store aggregates
and other calculated metrics
• Can be populated from batch or
streaming analytics
• Great for serving live
dashboards and reporting
CONFIDENTIAL 10
{
"metric": "session-length",
"visitor": "{2CC8C651-A9F4-4CB4-8639-7688FCD21D59}",
"visit-start": "2014-03-25 03:15:23",
"data": {
"value": 245.3,
"units": "seconds" }
}
}
11. • HDFS is optimized for scans;
seeks are very expensive
• As in relational databases,
secondary indexes can be
created on specific elements
• Hive even has indexing built in,
but keeps the results on HDFS
(still not optimized for seeks)
• Solution: Use separate NoSQL
database for secondary indexes
CONFIDENTIAL | Example: Secondary Indexing
CONFIDENTIAL 11
12. Sample Clickstream Data
• Sample Omniture clickstream files are available from Hortonworks
− 420,000+ page views over 15 days
− https://s3.amazonaws.com/hw-sandbox/tutorial8/RefineDemoData.zip
• Example records combine web page and visitor information, including
CONFIDENTIAL | geocoding:
1331434018 2012-03-10 18:46:58 2850813067829261564 4611687161967479390 FAS-2.8-
AS3 N 0 24.63.166.252 1 0 10 http://www.acme.com/SH5568487/VD55169229 {2CC8C651-
A9F4-4CB4-8639-7688FCD21D59} U en-US 313 598 1259 Y Y Y 1 2 304 comcast.net
10/2/2012 20:50:37 6 300 45 36 Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0;
WOW64; GTB7.3; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR
3.0.30618; .NET CLR 3.5.30729; .NET4.0C) 71 0 2 53 0 taunton usa 521 ma 0 0 0 0 0
ABC 0 120 ABC 0
1331434006 2012-03-10 18:46:46 2850864012585216412 6917530841728651042 FAS-2.8-
AS3 N 0 24.6.122.234 1 0 10 http://www.acme.com/SH55126545/VD55177927 {52B4FFFE-
606A-1C2B-77E7-F62057879CC8} U en-us 574 0 0 U U Y 0 0 304 comcast.net 10/2/2012
18:17:59 6 480 45 2 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us)
AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1 71 0 37 2 0 los
gatos usa 807 ca 0 0 0 0 0 KGO 0 120 KGO
CONFIDENTIAL 12
13. • Time is a very common dimension on which to organize data
• Great for processing incoming data and for filtering any time-based queries…
• …but can complicate other access patterns
Hive partitions
correspond to
directories on
HDFS
/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=1/000000_0
/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=2/000000_0
/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=3/000000_0
/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=4/000000_0
/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=5/000000_0
/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=6/000000_0
/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=7/000000_0
/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=8/000000_0
/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=9/000000_0
/apps/hive/warehouse/omniture_daily/year=2012/month=3/day=10/000000_0
[…]
CONFIDENTIAL | Time-Partitioned Data
CONFIDENTIAL 13
14. CONFIDENTIAL | Top 10 ≃ Bottom 2000
Distribution of geographic locations
detected in clickstream data:
> sum(subset(df, rank <=
10)$count)
[1] 36986
> sum(subset(df, rank >
max(df$rank) - 2000)$count)
[1] 33971
In this sample clickstream data
set, the top 10 cities account
for more traffic than the bottom
2,000 combined
Optimizations are usually
designed for the most common
cases
- “Biggest bang for the buck” due
to size, frequency, etc.
- What are the chances that the
optimizations you pick to
handle the most common cases
work well for the long tail?
- What if a new business
opportunity depends on the
long tail?
Welcome to the Long Tail
CONFIDENTIAL 14
> sum(subset(df, rank <= 10)$count)
[1] 36986
> sum(subset(df, rank > max(df$rank)
- 2000)$count)
[1] 33971
15. CONFIDENTIAL | Secondary Indexing in Hive
• Hive has built-in facilities to index data
create index location on table omniture_daily(city, state, country)
as 'COMPACT' with deferred rebuild;
alter index location on omniture_daily rebuild;
• Index stores pointers to locations of each found record (path, file, and
byte offset)
• However, resulting index is partitioned the same way as the
underlying table
CONFIDENTIAL 15
16. Column parsing determined
by Hive SerDe classes
CONFIDENTIAL | Exporting Hive Data as JSON
• Hive can easily read/write JSON data via a SerDe:
− https://github.com/sheetaldolas/Hive-JSON-Serde/tree/master
add jar json-serde-1.1.9.2-Hive13-jar-with-dependencies.jar;
create table json_export (
city string,
country string,
state string,
bucketname string,
offsets array<bigint>,
year int,
month int,
day int
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe’
STORED AS TEXTFILE;
insert into table json_export select * from
default__omniture_daily_location__;
CONFIDENTIAL 16
Hadoop’s InputFormat
and OutputFormat
17. Hive indices contain physical location of original data, including byte offsets:
{
"city": "taunton",
"state": "ma",
"country": "usa”,
"bucketname":
"hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/omniture_daily/yea
r=2012/month=3/day=10/000000_0”,
"offsets": [ 4748045, 3522685 ],
"year": 2012
"day": 10,
"month": 3,
}
CONFIDENTIAL | Sample Index entry
CONFIDENTIAL 17
18. $ hadoop fs -text /apps/hive/warehouse/json_export/000000_0 |
mongoimport --host localhost --db clickstream --collection locidx
CONFIDENTIAL | Exporting Index Data to Mongo
• Since our Hive index data is now stored on
HDFS as JSON format, it’s very easy to
load into Mongo directly.
• Don’t do this in production, but that’s what
makes simple examples so much fun:
CONFIDENTIAL 18
connected to: localhost
Sat Sep 27 10:30:22.325 100 16/second
Sat Sep 27 10:30:24.448 check 9 12262
Sat Sep 27 10:30:24.449 imported 12262 objects
19. Specific file on HDFS containing
the records of interest
CONFIDENTIAL | Querying the Index in Mongo
$ mongo localhost
MongoDB shell version: 2.4.6
connecting to: localhost
> use clickstream;
switched to db clickstream
> db.locidx.find( {'state':'ma', 'city':'taunton'} );
{ "_id" : ObjectId("5426f42e6a6b0b1939528f80"),
"bucketname” :
"hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/omniture_d
aily/year=2012/month=3/day=10/000000_0”,
"offsets" : [ 4748045, 3522685 ], "month" : 3, "state" : "ma",
"year" : 2012, "day" : 10, "country" : "usa", "city" : "taunton” }
CONFIDENTIAL 19
Byte offsets within that file
containing the records of interest
20. $ curl -L
'http://sandbox.hortonworks.com:50070/webhdfs/v1/apps/hive/warehouse/omniture
_daily/year=2012/month=3/day=10/000000_0?op=OPEN&offset=3522685&length=615';
echo
1331431385 2012-03-10 18:03:05 2850813067829261564 4611687161967479390 FAS-
2.8-AS3 N 0 24.63.166.252 1 0 10 http://www.acme.com/SH5568487/VD55169229
{2CC8C651-A9F4-4CB4-8639-7688FCD21D59} en-US 313 598 1259 Y Y Y 1 2 304
comcast.net 10/2/2012 20:50:37 6 300 45 36 Mozilla/4.0 (compatible; MSIE 7.0;
Windows NT 6.0; WOW64; GTB7.3; SLCC1; .NET CLR 2.0.50727; Media Center PC
5.0; .NET CLR 3.0.30618; .NET CLR 3.5.30729; .NET4.0C) 71 0 2 20 0 taunton
usa 521 ma 0 0 0 0 ABC 0 120 ABC
$ curl -L
'http://sandbox.hortonworks.com:50070/webhdfs/v1/apps/hive/warehouse/omniture
_daily/year=2012/month=3/day=10/000000_0?op=OPEN&offset=4748045&length=615';
echo
1331434018 2012-03-10 18:46:58 2850813067829261564 4611687161967479390 FAS-
2.8-AS3 N 0 24.63.166.252 1 0 10 http://www.acme.com/SH5568487/VD55169229
{2CC8C651-A9F4-4CB4-8639-7688FCD21D59} en-US 313 598 1259 Y Y Y 1 2 304
comcast.net 10/2/2012 20:50:37 6 300 45 36 Mozilla/4.0 (compatible; MSIE 7.0;
Windows NT 6.0; WOW64; GTB7.3; SLCC1; .NET CLR 2.0.50727; Media Center PC
5.0; .NET CLR 3.0.30618; .NET CLR 3.5.30729; .NET4.0C) 71 0 2 53 0 taunton
usa 521 ma 0 0 0 0 ABC 0 120 ABC
CONFIDENTIAL | Using the index data to retrieve the original data
CONFIDENTIAL 20
21. CONFIDENTIAL | So what’s the right way to do it?
Check out the MongoDB Connector for Hadoop
• Available at https://github.com/mongodb/mongo-hadoop
• Contains a “storage engine” to connect Hive directly to MongoDB for
live querying
• Provides a Hive SerDe for direct access to static BSON files (i.e.,
backup files)
• Allows Hadoop Streaming jobs (python, perl, R, etc.) access to Mongo
files
• And more
CONFIDENTIAL 21
22. Work with the
Leading Innovator in Big Data
DATA SCIENTISTS
DATA ARCHITECTS
DATA SOLUTIONS
Think Big Start Smart Scale Fast
CONFIDENTIA2L2
Editor's Notes
Think Big is a leading provider of big data solutions and analytic applications
We achieve this by working in lock-step with business leaders to align their goals with big data strategy and planning services
which become the roadmap for the data science and data engineering services we provide to implement big data projects.