Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

14,089 views

Published on

Can you load 20 million records into Salesforce in under an hour? If not, this webinar is for you.

You want to load tons of data into Salesforce. No problem, right? Just use the Bulk API and turn on parallel loading. Think again. Unless you carefully plan the big data loads that you want to break up into parallel operations to achieve maximum throughput, those loads can turn out more like slow, serial loads.

In this webinar, Sean and Steve will teach you how to realize awesome throughput in your parallel data loads on the Salesforce1 Platform. After learning from the webinar's demos and code samples, you'll be able to apply your new deep knowledge of platform internals to measure load performance, recognize problems that slow your loads down, and work around these roadblocks.

Key Takeaways
:: Learn what parallelism is and how significant optimizing it is for performance
:: Learn how to architect an integration or load tool to optimize parallelism, and obtain the maximum possible throughput
:: Learn how to manage locks to avoid lock exceptions that can significantly reduce the throughput in your loads and integrations

Intended Audience
:: Salesforce architects or Force.com developers with a working understanding of data loading and integration concepts. A high-level understanding of the Bulk API and Java is also useful.

Published in: Technology

Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar

  1. 1. Salesforce API Series Fast Parallel Data Loading with the Bulk API February 26, 2014
  2. 2. Safe Harbor Safe harbor statement under the Private Securities Litigation Reform Act of 1995: This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services. The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of intellectual property and other litigation, risks associated with possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling nonsalesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-Q for the most recent fiscal quarter ended July 31, 2012. This documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site. Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements. #forcewebinar
  3. 3. Speakers Steve Bobrowski Architect Evangelist @sbob909 #forcewebinar Sean Regan Architect Evangelist @sfdcsregan
  4. 4. Follow Developer Force for the Latest News @forcedotcom / #forcewebinar Developer Force – Force.com Community +Developer Force – Force.com Community Developer Force Developer Force Group #forcewebinar
  5. 5. How fast can you load data into Salesforce?
  6. 6. How many records can you load into Salesforce in 1 hour?
  7. 7. Data load throughput Records/Hour 25,000,000 20,000,000 15,000,000 10,000,000 5,000,000 OK #forcewebinar Fast Faster
  8. 8. Parallel processing
  9. 9. A parallel processing analogy: digging a ditch #forcewebinar
  10. 10. Serial processing #forcewebinar
  11. 11. Parallel processing #forcewebinar
  12. 12. The number of processes or threads associated with an operation.
  13. 13. Optimal parallel processing 5M records Parallel 5M records 5M records 5M records Serial 20M records Time #forcewebinar
  14. 14. Sub-optimal parallel processing 5M records Parallel 5M records 5M records 5M records Serial 20M records Time #forcewebinar
  15. 15. Locks, exceptions, triggers, relationships, … 5M records Parallel 5M records 5M records 5M records Serial 20M records Time #forcewebinar Throughput inhibitors
  16. 16. Data load case studies §  Get hands on with the Salesforce Bulk API §  Contrast serial data loads vs. parallel data loads §  Measure degrees of parallelism and throughput §  Identify and avoid throughput inhibitors §  Achieve maximum throughput #forcewebinar
  17. 17. Prep work
  18. 18. Salesforce Bulk API §  Asynchronous data loading §  Optimized for large data sets §  REST API §  Powers many tools §  Use to build custom tools with any programming language (Java, etc.) #forcewebinar
  19. 19. Demo schema #forcewebinar
  20. 20. Bulk API Loads that … ealize, nvestigate, and lan
  21. 21. Case Studies
  22. 22. Serial Data Load
  23. 23. Serial load: Expected plan Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread •  •  •  •  Time #forcewebinar One job 100 batches 10,000 records/batch 1M total records
  24. 24. Serial load: Job configuration #forcewebinar
  25. 25. Serial load: Batch creation #forcewebinar
  26. 26. Serial load: Batch run #forcewebinar
  27. 27. Demo Serial load
  28. 28. Serial load summary Concurrency Mode Records Loaded Records Failed Serial 1 million 0 Run Time 52 minutes Work Completed 48 minutes Throughput Degree of Parallelism Key Problem Solution 19,500 records per minute 0.94 Degree of parallelism explicitly limited to ~1. Explore parallel load for increased throughput. #forcewebinar
  29. 29. Throughput Records/Min Parallelism vs. Throughput of a Single Job 350000 Serial Run •  Low degree of parallelism 300000 250000 200000 150000 100000 50000 Serial 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Degree of Parallelism #forcewebinar
  30. 30. Parallel data loads
  31. 31. Parallel load: Expected plan Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread •  •  •  •  One job 100 batches 10,000 records/batch 1M total records Time #forcewebinar
  32. 32. Parallel load: Job configuration #forcewebinar
  33. 33. Things to watch for §  Locks can significantly affect parallel loads –  Wasted processing capacity –  Reduced throughput –  Failures §  Retry logic is not all its cracked up to be #forcewebinar
  34. 34. Demo Parallel 1
  35. 35. Parallel load 1 summary Concurrency Mode Records Loaded Records Failed Parallel 125,000 875,000 Run Time 10 minutes Work Completed 2 hours and 30 minutes Throughput Degree of Parallelism Key Problem Solution 20,000 records per minute 15.79 Lock Exceptions. Server worked significantly harder but no increase in throughput. Run the load in serial mode or manage locks. #forcewebinar
  36. 36. Throughput Records/Min Parallelism vs. throughput of a single job 350000 Parallel Run 1 •  High degree of parallelism •  Low throughput due to locks 300000 250000 200000 150000 100000 50000 Serial Parallel 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Degree of Parallelism #forcewebinar
  37. 37. Time to optimize §  §  Let’s make your data load ealize –  Locks inhibit parallelism and throughput §  nvestigate –  What is causing the locks §  lan –  Manage the locks #forcewebinar
  38. 38. Demo Parallel load 2 Eliminate Locks by Modifying Schema
  39. 39. Parallel load: Sample results Concurrency Mode Records Loaded Records Failed Parallel 1 million 0 Run Time 3 minutes and 30 seconds Work Completed 1 hour Throughput Degree of Parallelism Key Problem Solution 320,000 records per minute 19 None n/a #forcewebinar
  40. 40. Throughput Records/Min Parallelism vs. throughput of a single job 350000 Parallel 2 Parallel Run 2 •  High degree of parallelism •  High throughput 300000 250000 200000 150000 100000 50000 Serial Parallel 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Degree of Parallelism #forcewebinar
  41. 41. Locks can be managed by §  Elimination §  Ordering load file #forcewebinar
  42. 42. Demo Parallel load 3 Avoid Locks with Ordered Data
  43. 43. Managing locks … a discussion while we load §  Master-detail relationships §  Lookup relationships §  Roll-up summary fields §  Triggers §  Workflow rules §  Group membership locks* #forcewebinar
  44. 44. Parallel load: Sample results Concurrency Mode Records Loaded Records Failed Parallel 1 million 0 Run Time 4 minutes Work Completed 1 hour Throughput Degree of Parallelism Key Problem Solution 250,000 records per minute 16.5 Minimal overhead due to locks Remove all unnecessary locks #forcewebinar
  45. 45. Throughput Records/Min Parallelism vs. throughput of a single job 350000 Parallel Run 3 •  High degree of parallelism •  High throughput 300000 250000 Parallel 2 Parallel 3 200000 150000 100000 50000 Serial Parallel 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Degree of Parallelism #forcewebinar
  46. 46. Controlled feed/parallel data loads
  47. 47. Controlled feed load methodology §  Explicit throttling on parallelism and throughput –  Parallel extraction and loading –  Prioritization of asynchronous processing capacity §  Manage inhibitors in complex jobs –  Data Skews –  Multiple Locks #forcewebinar
  48. 48. Throughput Records/Min Parallelism vs. throughput of a single job 350000 Parallel 2 Controlled Feed Run •  Reduced parallelism •  Expected throughput 300000 250000 Parallel 3 200000 150000 100000 Controlled Feed 50000 Serial Parallel 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Degree of Parallelism #forcewebinar
  49. 49. Related wiki article and Architect Core Resources #forcewebinar
  50. 50. Recap §  §  Make your parallel data loads ealize –  Locks inhibit parallelism and throughput §  nvestigate –  What is causing the locks §  lan –  Manage the locks #forcewebinar
  51. 51. Q&A Steve Bobrowski Architect Evangelist @sbob909 #forcewebinar Sean Regan Architect Evangelist @sfdcsregan

×