Big data is a nebulous term. In addition to being the name of a band (which is now one of my Pandora stations), it also describes very, very, very large amounts of data collected in all range of disciplines, from finance to astronomy and everything in between (yes, even in marine sciences!). With the advent of technology like remote sensing and autonomous platforms, marine scientists are collecting more and more data for more and more parameters. Which is great when you’re trying to get a handle on all the possible processes in areas, say, the size of the ocean. But as with most things, there’s a flip side. With all that data comes all that data processing, and when you only have a limited number of graduate students (and only a limited number of hours in the day), it becomes difficult to actually make sense of the data gathered. We’re gathering more data than we can even comprehend. So what do we do about it?
For one, build bigger computers and smarter programs. But that takes money and time, two things most average scientists don’t have access to. So scientists have started to think outside the box. Ever heard of crowdsourcing? In this day and age, it would be hard not to have heard, but as a brief overview, it’s a way of getting something (the most common example I’ve heard of is money) from lots of different people. In terms of science, crowdsourcing refers to getting other people to complete your data analysis for you. Not only does it allow for more data to be processed, but it lets anybody from your 5th-grade science fair student to your 103-year-old retired grandmother, get involved with science.
One example of the intersection of big data, crowdsourcing, and science is Zooniverse. I just stumbled upon this website recently (which is what got me thinking about the idea of big data in the first place), and found a wealth of projects from exploring the surface of the moon to annotating diaries of World War I soldiers. The idea is to get people from all walks of life to become ‘citizen scientists’. I decided to check out ‘Plankton Portal’ to see what being a citizen scientist is all about.
The idea of Plankton Portal is to identify zooplankton species in thousands of photos taken all around the photic zone all around the world. The journey starts with a quick tutorial, guiding you through how to identify and measure these tiny organisms. After that, you can create a profile and start identifying! Wait a minute, you’re thinking, how can I become an expert zooplankton identifier after one short tutorial, and how can the ‘actual’ scientists really trust my data? They’ve addressed this by the use of statistics. Each single photograph is ‘analyzed’ by several citizen scientists and then the results essentially averaged. This way, if citizen scientist A misses something, citizen scientist B might pick it up.
This model is fantastic for getting people from all age ranges and walks of life involved in science. Which in and of itself is incredibly valuable. It gives students hands-on, real-world experience into the world of science (and the website does do a good job of explaining who, what, when, where, and why the data was collected). And of course, the scientists get their images analyzed, which does make their lives a little easier. But, you’re still left with all of this data (albeit, transformed from pixilated images to data points). What do you do with all these numbers now?
This is a question all marine scientists (including me!) struggle with. You’ve gathered all this data, but now, how do you make sense of it all? How do you know the data you collected is valid? Thankfully, there are statistical analyses that can help analyze these things (don’t ask me to go into the details, as far as I know, it’s some kind of magic). But even these start to break down when you collect data every 15 minutes for up to 10 parameters (been there, done that). At that point, it’s just a huge soup bowl full of numbers, letters, and units.
Some have even argued that big data is actually not so great for science. While a little dated, there was an article in ScienceNews (Why Big Data is Bad for Science) way back in 2013 that I think, really summed up the issues with big data. Not only is so much data overwhelming (to both the scientists and the statistical tests), but it also generates problems with correctly interpreting correlations. With so much data, statistical tests will pick-out a correlation that is really just an artifact of having so much data, essentially making correlations out of thin-air.
So be warned: we, as scientists, need to be wary. Even though we can draw upon resources that give us vast quantities of data, we need to be careful of how we use this data and how we analyze it; we need to understand where the potential biases are and the pitfalls of having large data sets. It’s a difficult problem that just highlights (like many other issues) the need for cross-discipline collaborations between scientists, computer scientists, and statisticians.
Pingback: UNderthC’s Year in Review | UNder the C