Big Data 
Overview – Part 1 
Wm. Barrett Simms 
barrett@wbsimms.com 
@wbsimms
Opening remarks 
• Sponsors 
• Pluralsight 
• Free month gift card give away. Enter your name in the pot! 
• DevExpress 
• $250 in developer JustCode tools. 
• O’Reilly 
• Book give away. Enter your name in the pot! 
• Boston Code Camp 22 (November 22nd) 
• http://www.bostoncodecamp.com/ 
• Thanks to 3thought for the space
About Me 
Software 
Developer 
Agile Team 
Member 
Team Lead 
Agile 
Advocate 
SDLC 
Implementer
SDLC
Big Data 
“Big data is an all-encompassing term for any collection of data sets so 
large and complex that it becomes difficult to process using traditional 
data processing applications.” 
- Wikipedia
The 3 Vs 
• Volume 
• A few Gigabytes -> Petabyte 
• Velocity 
• Arrives quickly 
• Variety 
• Multiple Sources
Volume 
• Traditional SQL architectures don’t scale to very large 
• Maybe this isn’t so true 
…but the MMP systems are expensive
An example problem (Volume) 
• You own a chain of stores 
• … with 25,000 stores and 100,000 POS systems 
• Need information on inventory changes 
• By region 
• By store
Velocity 
• Traditional solutions don’t handle fast inbound data 
• Maybe this isn’t so true 
…but you lose data.
Another example (Velocity) 
• You host a website 
• … on 10,000 servers 
• Monitor logs for errors
Variety 
• Most traditional solutions don’t handle a variety of data types well 
• Maybe this isn’t so true 
…But you need to write a custom importer for every type.
A final example (Variety) 
• You own a business 
• With a sales and marketing teams 
• … in different regions around the world 
• Correlate sales numbers against marketing expenses
The First Problem : Computing Power 
First Second Third 
First Second Third 
First Second Third 
First Second Third 
First Second Third 
Limited by cores 
(Scaling up)
Solution: Scale out (not up!) 
Server 1 Server 2 
Coordinator 
Server 3 Server 4
Coordination 
Job Coordinator 
Runner 
Runner 
Runner
MapReduce 
• A programming model and an associated implementation for 
processing and generating large data sets with a parallel, distributed 
algorithm on a cluster. – Wikipedia 
WHAT?
Map and Reduce 
• Map 
• Process data returning key value pairs 
• Reduce 
• Aggregate/Filter key value pairs into result 
Map 
Map 
Data 
Data 
Reduce Result
Mapping 
• Easy example 
• Store Sales 
• Find most sales per store in 2010 
Year Month Store Id SalesTotal 
2010 1 13 1,000 
2010 3 43 12,000 
2010 3 21 21,000 
2010 4 13 3,000 
2010 2 56 4,000 
2010 6 32 12,000 
2010 7 1 4,000 
2010 2 23 2,000
Solution – Map 
1. Mapper feeds document rows to your program 
2. You return key value pairs 
StoreId Sales 
21 2,000 
23 3,000 
2 1,000 
21 23,000
Solution - Reduce 
• Data is merged 
• Merged into Key/Values: 
{21, [2,000, 23,000]} 
{23, [3,000]} 
{2, [1,000]} 
• You process each row
Data Access 
• Each process needs access to data 
Typical Desired
HDFS 
• Hadoop File System 
• Open-source implementation of the Google File System (GFS) 
Hard drives last about 1,000 days. So, 
if you have 1K hard drives, you’ll lose 
one per day.
The ecosystem 
• Hive 
• SQL-like query language 
• Define and enforce schema 
• Pig 
• SQL-like query language 
• Sqoop 
• SQL/Hadoop integration 
• Oozie 
• Scheduling 
• Mahout 
• Machine Learning interface 
• Storm 
• Stream-based MapReduce 
… and Many Others
Vendors 
• Hortonworks 
• Single click install of Sandbox 
• Cloudera 
• Downloadable VM 
• Syncfusion 
• Single click install of Syncfusion Big Data 
• Amazon AWS 
• Elastic MapReduce 
• Microsoft Azure 
• HDInsight
Contact Me 
Barrett Simms 
barrett@wbsimms.com 
http://wbsimms.com 
Twitter: @wbsimms 
Phone: 781.405.4686

Big Data Overview Part 1

  • 1.
    Big Data Overview– Part 1 Wm. Barrett Simms barrett@wbsimms.com @wbsimms
  • 2.
    Opening remarks •Sponsors • Pluralsight • Free month gift card give away. Enter your name in the pot! • DevExpress • $250 in developer JustCode tools. • O’Reilly • Book give away. Enter your name in the pot! • Boston Code Camp 22 (November 22nd) • http://www.bostoncodecamp.com/ • Thanks to 3thought for the space
  • 3.
    About Me Software Developer Agile Team Member Team Lead Agile Advocate SDLC Implementer
  • 4.
  • 5.
    Big Data “Bigdata is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.” - Wikipedia
  • 6.
    The 3 Vs • Volume • A few Gigabytes -> Petabyte • Velocity • Arrives quickly • Variety • Multiple Sources
  • 7.
    Volume • TraditionalSQL architectures don’t scale to very large • Maybe this isn’t so true …but the MMP systems are expensive
  • 8.
    An example problem(Volume) • You own a chain of stores • … with 25,000 stores and 100,000 POS systems • Need information on inventory changes • By region • By store
  • 9.
    Velocity • Traditionalsolutions don’t handle fast inbound data • Maybe this isn’t so true …but you lose data.
  • 10.
    Another example (Velocity) • You host a website • … on 10,000 servers • Monitor logs for errors
  • 11.
    Variety • Mosttraditional solutions don’t handle a variety of data types well • Maybe this isn’t so true …But you need to write a custom importer for every type.
  • 12.
    A final example(Variety) • You own a business • With a sales and marketing teams • … in different regions around the world • Correlate sales numbers against marketing expenses
  • 13.
    The First Problem: Computing Power First Second Third First Second Third First Second Third First Second Third First Second Third Limited by cores (Scaling up)
  • 14.
    Solution: Scale out(not up!) Server 1 Server 2 Coordinator Server 3 Server 4
  • 15.
    Coordination Job Coordinator Runner Runner Runner
  • 16.
    MapReduce • Aprogramming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. – Wikipedia WHAT?
  • 17.
    Map and Reduce • Map • Process data returning key value pairs • Reduce • Aggregate/Filter key value pairs into result Map Map Data Data Reduce Result
  • 18.
    Mapping • Easyexample • Store Sales • Find most sales per store in 2010 Year Month Store Id SalesTotal 2010 1 13 1,000 2010 3 43 12,000 2010 3 21 21,000 2010 4 13 3,000 2010 2 56 4,000 2010 6 32 12,000 2010 7 1 4,000 2010 2 23 2,000
  • 19.
    Solution – Map 1. Mapper feeds document rows to your program 2. You return key value pairs StoreId Sales 21 2,000 23 3,000 2 1,000 21 23,000
  • 20.
    Solution - Reduce • Data is merged • Merged into Key/Values: {21, [2,000, 23,000]} {23, [3,000]} {2, [1,000]} • You process each row
  • 21.
    Data Access •Each process needs access to data Typical Desired
  • 22.
    HDFS • HadoopFile System • Open-source implementation of the Google File System (GFS) Hard drives last about 1,000 days. So, if you have 1K hard drives, you’ll lose one per day.
  • 23.
    The ecosystem •Hive • SQL-like query language • Define and enforce schema • Pig • SQL-like query language • Sqoop • SQL/Hadoop integration • Oozie • Scheduling • Mahout • Machine Learning interface • Storm • Stream-based MapReduce … and Many Others
  • 24.
    Vendors • Hortonworks • Single click install of Sandbox • Cloudera • Downloadable VM • Syncfusion • Single click install of Syncfusion Big Data • Amazon AWS • Elastic MapReduce • Microsoft Azure • HDInsight
  • 25.
    Contact Me BarrettSimms barrett@wbsimms.com http://wbsimms.com Twitter: @wbsimms Phone: 781.405.4686

Editor's Notes

  • #2 Welcome!
  • #4 Focus on technical product delivery
  • #14 Each inbound request spawns three processes. Spawning multiple processes isn’t scalable