Big Data Overview Part 1

Big Data
Overview – Part 1
Wm. Barrett Simms
barrett@wbsimms.com
@wbsimms

Opening remarks
• Sponsors
• Pluralsight
• Free month gift card give away. Enter your name in the pot!
• DevExpress
• $250 in developer JustCode tools.
• O’Reilly
• Book give away. Enter your name in the pot!
• Boston Code Camp 22 (November 22nd)
• http://www.bostoncodecamp.com/
• Thanks to 3thought for the space

About Me
Software
Developer
Agile Team
Member
Team Lead
Agile
Advocate
SDLC
Implementer

Big Data
“Big data is an all-encompassing term for any collection of data sets so
large and complex that it becomes difficult to process using traditional
data processing applications.”
- Wikipedia

The 3 Vs
• Volume
• A few Gigabytes -> Petabyte
• Velocity
• Arrives quickly
• Variety
• Multiple Sources

Volume
• Traditional SQL architectures don’t scale to very large
• Maybe this isn’t so true
…but the MMP systems are expensive

An example problem (Volume)
• You own a chain of stores
• … with 25,000 stores and 100,000 POS systems
• Need information on inventory changes
• By region
• By store

Velocity
• Traditional solutions don’t handle fast inbound data
…but you lose data.

Another example (Velocity)
• You host a website
• … on 10,000 servers
• Monitor logs for errors

Variety
• Most traditional solutions don’t handle a variety of data types well
…But you need to write a custom importer for every type.

A final example (Variety)
• You own a business
• With a sales and marketing teams
• … in different regions around the world
• Correlate sales numbers against marketing expenses

The First Problem : Computing Power
First Second Third
First Second Third
First Second Third
First Second Third
First Second Third
Limited by cores
(Scaling up)

Solution: Scale out (not up!)
Server 1 Server 2
Coordinator
Server 3 Server 4

Coordination
Job Coordinator
Runner
Runner
Runner

MapReduce
• A programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed
algorithm on a cluster. – Wikipedia
WHAT?

Map and Reduce
• Map
• Process data returning key value pairs
• Reduce
• Aggregate/Filter key value pairs into result
Map
Map
Data
Data
Reduce Result

Mapping
• Easy example
• Store Sales
• Find most sales per store in 2010
Year Month Store Id SalesTotal
2010 1 13 1,000
2010 3 43 12,000
2010 3 21 21,000
2010 4 13 3,000
2010 2 56 4,000
2010 6 32 12,000
2010 7 1 4,000
2010 2 23 2,000

Solution – Map
1. Mapper feeds document rows to your program
2. You return key value pairs
StoreId Sales
21 2,000
23 3,000
2 1,000
21 23,000

Solution - Reduce
• Data is merged
• Merged into Key/Values:
{21, [2,000, 23,000]}
{23, [3,000]}
{2, [1,000]}
• You process each row

Data Access
• Each process needs access to data
Typical Desired

HDFS
• Hadoop File System
• Open-source implementation of the Google File System (GFS)
Hard drives last about 1,000 days. So,
if you have 1K hard drives, you’ll lose
one per day.

The ecosystem
• Hive
• SQL-like query language
• Define and enforce schema
• Pig
• SQL-like query language
• Sqoop
• SQL/Hadoop integration
• Oozie
• Scheduling
• Mahout
• Machine Learning interface
• Storm
• Stream-based MapReduce
… and Many Others

Vendors
• Hortonworks
• Single click install of Sandbox
• Cloudera
• Downloadable VM
• Syncfusion
• Single click install of Syncfusion Big Data
• Amazon AWS
• Elastic MapReduce
• Microsoft Azure
• HDInsight

Contact Me
Barrett Simms
barrett@wbsimms.com
http://wbsimms.com
Twitter: @wbsimms
Phone: 781.405.4686

Big Data Overview Part 1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Big Data Overview Part 1

Similar to Big Data Overview Part 1 (20)

Recently uploaded

Recently uploaded (20)

Big Data Overview Part 1

Editor's Notes